I am saving a dataframe to a CSV file in PySpark using below statement:
df_all.repartition(1).write.csv("xyz.csv", header=True, mode='overwrite') But i am getting below error
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 218, in main func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type) File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 138, in read_udfs arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type) File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 118, in read_single_udf f, return_type = read_command(pickleSer, infile) File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 58, in read_command command = serializer._read_with_length(file) File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 170, in _read_with_length return self.loads(obj) File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 559, in loads return pickle.loads(obj, encoding=encoding) ModuleNotFoundError: No module named 'app' i am using PySpark version 2.3.0
I am getting this error while trying to write to a file.
import json, jsonschema from pyspark.sql import functions from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType, StringType, FloatType from datetime import datetime import os feb = self.filter_data(self.SRC_DIR + "tl_feb19.csv", 13) apr = self.filter_data(self.SRC_DIR + "tl_apr19.csv", 15) df_all = feb.union(apr) df_all = df_all.dropDuplicates(subset=["PRIMARY_ID"]) create_emi_amount_udf = udf(create_emi_amount, FloatType()) df_all = df_all.withColumn("EMI_Amount", create_emi_amount_udf('Sanction_Amount', 'Loan_Type')) df_all.write.csv(self.DST_DIR + "merged_amounts.csv", header=True, mode='overwrite')