如何将文件传递到主节点？

小编典典

如何将文件传递到主节点？

python

我已经用Python编写了实现二进制分类的代码，并且我想使用Apache-Spark基于本地计算机中的不同数据文件来并行化此分类过程。我已经完成了以下步骤：

我编写了包含4个python文件的整个项目：“ run_classifer.py”（用于运行分类应用程序），“ classifer.py”（用于二进制分类），“ load_params.py”（用于加载学习参数）用于分类）和“ preprocessing.py”（用于预处理数据）。该项目还使用依赖项文件：“ tokenizer.perl”（在预处理部分中使用）和“ nonbreaking_prefixes / nonbreaking_prefix.en”（也在预处理部分中使用）。
我的脚本文件“ run_classifer.py”的主要部分定义如下：
```
### Initialize the Spark
```
conf = SparkConf().setAppName(“ruofan”).setMaster(“local”)
sc = SparkContext(conf = conf,
pyFiles=[‘’‘All python files in my project as
well as “nonbreaking_prefix.en” and “tokenizer.perl”’‘’])

Read data directory from S3 storage, and create RDD

datafile = sc.wholeTextFiles(“s3n://bucket/data_dir”)

Sent the application on each of the slave node

datafile.foreach(lambda (path, content): classifier(path, content))

但是，当我运行脚本“ run_classifier.py”时，似乎找不到文件“ nonbreaking_prefix.en”。以下是我得到的错误：

错误：在/ tmp / spark-f035270e-e267-4d71-9bf1-8c42ca2097ee /
userFiles-88093e1a-6096-4592-8a71-be5548a4f8ae /
nonbreaking_prefixes中找不到缩写文件

但是我实际上将文件“ nonbreaking_prefix.en”传递给了主节点，对此错误我一无所知。如果有人帮助我解决问题，我将不胜感激。

阅读 225

2020-12-20

共1个答案

小编典典

您可以使用来上传文件，sc.addFile并使用来获取工作人员的路径SparkFiles.get：

from pyspark import SparkFiles

sc = (SparkContext(conf = conf,
    pyFiles=["All",  "Python", "Files",  "in",  "your", "project"])

# Assuming both files are in your working directory
sc.addFile("nonbreaking_prefix.en")
sc.addFile("tokenizer.perl")

def classifier(path, content):
   # Get path for uploaded files
   print SparkFiles.get("tokenizer.perl")

   with open(SparkFiles.get("nonbreaking_prefix.en")) as fr:
       lines = [line for line in fr]

2020-12-20

如何将文件传递到主节点？

Read data directory from S3 storage, and create RDD

Sent the application on each of the slave node

共1个答案