11

I am saving a dataframe to a CSV file in PySpark using below statement:

df_all.repartition(1).write.csv("xyz.csv", header=True, mode='overwrite') 

But i am getting below error

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 218, in main func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type) File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 138, in read_udfs arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type) File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 118, in read_single_udf f, return_type = read_command(pickleSer, infile) File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 58, in read_command command = serializer._read_with_length(file) File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 170, in _read_with_length return self.loads(obj) File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 559, in loads return pickle.loads(obj, encoding=encoding) ModuleNotFoundError: No module named 'app' 

i am using PySpark version 2.3.0

I am getting this error while trying to write to a file.

 import json, jsonschema from pyspark.sql import functions from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType, StringType, FloatType from datetime import datetime import os feb = self.filter_data(self.SRC_DIR + "tl_feb19.csv", 13) apr = self.filter_data(self.SRC_DIR + "tl_apr19.csv", 15) df_all = feb.union(apr) df_all = df_all.dropDuplicates(subset=["PRIMARY_ID"]) create_emi_amount_udf = udf(create_emi_amount, FloatType()) df_all = df_all.withColumn("EMI_Amount", create_emi_amount_udf('Sanction_Amount', 'Loan_Type')) df_all.write.csv(self.DST_DIR + "merged_amounts.csv", header=True, mode='overwrite') 
8
  • This is code issue, created by wrong import. You may would like to add your code to the Question. Commented Jul 5, 2019 at 16:01
  • Please add more code, at least something about "app" Commented Jul 5, 2019 at 16:34
  • @syadav added the code. I dont have anything with name app in my code Commented Jul 9, 2019 at 6:17
  • @DuyNguyenHoang added the code Commented Jul 9, 2019 at 6:17
  • 2
    If you're experiencing this error on Azure Databricks install the dependencies under Libraries under Your_Cluster_Name in Compute. This will resolve the error Commented Aug 31, 2021 at 12:35

2 Answers 2

16

The error is very clear, there is not the module 'app'. Your Python code runs on driver, but you udf runs on executor PVM. When you call the udf, spark serializes the create_emi_amount to sent it to the executors.

So, somewhere in your method create_emi_amount you use or import the app module. A solution to your problem is to use the same environment in both driver and executors. In spark-env.sh set the save Python virtualenv in PYSPARK_DRIVER_PYTHON=... and PYSPARK_PYTHON=....

Sign up to request clarification or add additional context in comments.

1 Comment

could you please expand yours answer ? I'm newbie with Spark, and I am facing the same problem.
3

Thanks to ggeop! He helped me out. ggeop has explained the problem. But the solution may not be correct if the 'app' is his own package.

My solution is to add the file in sparkcontext:

sc = SparkContext() sc.addPyFile("app.zip") 

But you have to zip app package first, and you have to make sure the zipped packaged get app directory.
i.e. if your app is at:/home/workplace/app then you have to do the zip under workplace, which will zip all directories under workplace including app.

The other way is to send the file in spark-submit, as below:

--py-files app.zip --py-files myapp.py 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.