52

I am trying to overwrite a Spark dataframe using the following option in PySpark but I am not successful

spark_df.write.format('com.databricks.spark.csv').option("header", "true",mode='overwrite').save(self.output_file_path) 

the mode=overwrite command is not successful

2 Answers 2

86

Try:

spark_df.write.format('com.databricks.spark.csv') \ .mode('overwrite').option("header", "true").save(self.output_file_path) 
Sign up to request clarification or add additional context in comments.

1 Comment

it worked for updating json file on hdfs doc.write.format('json').mode("append").option("header","true").save(/path/to/hdfs_file)
32

Update 2023

Docs changed again. DataFrameWriter.csv: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.csv.html#pyspark.sql.DataFrameWriter.csv

EDIT 2021

The docs have had a huge facelift which may be good from the perspective of new users discovering functionality from a requirement perspective, but does need some adjusting to.

DataframeReader and DataframeWriter are now part of the Input/Output in the API docs: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#input-and-output

The DataframeWriter.csv callable is now here https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameWriter.csv.html#pyspark.sql.DataFrameWriter.csv

Original answer

Spark 1.4 and above has a built in csv function for the dataframewriter

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter

e.g.

spark_df.write.csv(path=self.output_file_path, header="true", mode="overwrite", sep="\t") 

Which is syntactic sugar for

spark_df.write.format("csv").mode("overwrite").options(header="true",sep="\t").save(path=self.output_file_path) 

I think what is confusing is finding where exactly the options are available for each format in the docs.

These write related methods belong to the DataFrameWriter class: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter

The csv method has these options available, also available when using format("csv"): https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.csv

The way you need to supply parameters also depends on if the method takes a single (key, value) tuple or keyword args. It's fairly standard to the way python works generally though, using (*args, **kwargs), it just differs from the Scala syntax.

For example The option(key, value) method takes one option as a tuple like option(header,"true") and the .options(**options) method takes a bunch of keyword assignments e.g. .options(header="true",sep="\t")

1 Comment

May I suggest that the latest edit be on top, not at the bottom.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.