0

I'm trying to run the following Python script locally, using spark-submit command:

import sys sys.path.insert(0, '.') from pyspark import SparkContext, SparkConf from commons.Utils import Utils def splitComma(line): splits = Utils.COMMA_DELIMITER.split(line) return "{}, {}".format(splits[1], splits[2]) if __name__ == "__main__": conf = SparkConf().setAppName("airports").setMaster("local[2]") sc = SparkContext(conf = conf) airports = sc.textFile("in/airports.text") airportsInUSA = airports\ .filter(lambda line : Utils.COMMA_DELIMITER.split(line)[3] == "\"United States\"") airportsNameAndCityNames = airportsInUSA.map(splitComma) airportsNameAndCityNames.saveAsTextFile("out/airports_in_usa.text") 

The command used (while inside the project directory):

spark-submit rdd/AirportsInUsaSolution.py 

I keep getting this error:

Traceback (most recent call last): File "/home/gustavo/Documentos/TCC/python_spark_yt/python-spark-tutorial/rdd/AirportsInUsaSolution.py", line 4, in from commons.Utils import Utils ImportError: No module named commons.Utils

Even though there is a commons.Utils with a Utils class.

It seems that the only imports it accepts are the ones from Spark, because this error persists when I try to import any other class or file from my project.

0

4 Answers 4

3
from pyspark import SparkContext, SparkConf def splitComma(line): splits = Utils.COMMA_DELIMITER.split(line) return "{}, {}".format(splits[1], splits[2]) if __name__ == "__main__": conf = SparkConf().setAppName("airports").setMaster("local[2]") sc = SparkContext(conf = conf) sc.addPyFile('.../pathto commons.zip') from commons import Utils airports = sc.textFile("in/airports.text") airportsInUSA = airports\ .filter(lambda line : Utils.COMMA_DELIMITER.split(line)[3] == "\"United States\"") airportsNameAndCityNames = airportsInUSA.map(splitComma) airportsNameAndCityNames.saveAsTextFile("out/airports_in_usa.text") 

Yes, it only accepts the ones from the Spark. You can zip the required files (Utils, numpy) etc and specify the parameter --py-files in the spark-submit.

spark-submit --py-files rdd/file.zip rdd/AirportsInUsaSolution.py 
Sign up to request clarification or add additional context in comments.

Comments

0

for python to consider a directory as package you need to create __init__.py in that directory. The __init__.py file doesn't need to contain anything.

In this case once you create __init__.py in the commons directory you will be able to access that package.

Comments

0

I think problem is in SRARK configuration. Add, pls, PYSPARK_PYTHON environment variable in your ~/.bashrc. In my case, it looks like : export PYSPARK_PYTHON =/home/comrade/environments/spark/bin/python3, where PYSPARK_PYTHON is path to my python executable in "spark" environment.1

Hope, it helps)

Comments

-1

Create a python script named: Utils.py which will contain:

import re class Utils(): COMMA_DELIMITER = re.compile(''',(?=(?:[^"]*"[^"]*")*[^"]*$)''') 

Put this Utils.py python script on a commons folder and put this folder in your working directory (type pwd to know it). You can then import the Utils class:

from commons.Utils import Utils 

Hope it will help you.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.