overwriting a spark output using pyspark

Question

I am trying to overwrite a Spark dataframe using the following option in PySpark but I am not successful

spark_df.write.format('com.databricks.spark.csv').option("header", "true",mode='overwrite').save(self.output_file_path)

the mode=overwrite command is not successful

user6022341 · Accepted Answer · 2016-03-08 07:11:20Z

86

Try:

spark_df.write.format('com.databricks.spark.csv') \ .mode('overwrite').option("header", "true").save(self.output_file_path)

answered Mar 8, 2016 at 7:11

community wiki

user6022341

Sign up to request clarification or add additional context in comments.

1 Comment

mnis.p Over a year ago

it worked for updating json file on hdfs doc.write.format('json').mode("append").option("header","true").save(/path/to/hdfs_file)

Davos · Accepted Answer · 2024-10-29 00:46:39Z

Update 2023

Docs changed again. DataFrameWriter.csv: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.csv.html#pyspark.sql.DataFrameWriter.csv

EDIT 2021

The docs have had a huge facelift which may be good from the perspective of new users discovering functionality from a requirement perspective, but does need some adjusting to.

DataframeReader and DataframeWriter are now part of the Input/Output in the API docs: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#input-and-output

The DataframeWriter.csv callable is now here https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameWriter.csv.html#pyspark.sql.DataFrameWriter.csv

Original answer

Spark 1.4 and above has a built in csv function for the dataframewriter

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter

e.g.

spark_df.write.csv(path=self.output_file_path, header="true", mode="overwrite", sep="\t")

Which is syntactic sugar for

spark_df.write.format("csv").mode("overwrite").options(header="true",sep="\t").save(path=self.output_file_path)

I think what is confusing is finding where exactly the options are available for each format in the docs.

These write related methods belong to the DataFrameWriter class: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter

The csv method has these options available, also available when using format("csv"): https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.csv

The way you need to supply parameters also depends on if the method takes a single (key, value) tuple or keyword args. It's fairly standard to the way python works generally though, using (*args, **kwargs), it just differs from the Scala syntax.

For example The option(key, value) method takes one option as a tuple like option(header,"true") and the .options(**options) method takes a bunch of keyword assignments e.g. .options(header="true",sep="\t")

May I suggest that the latest edit be on top, not at the bottom.

Collectives™ on Stack Overflow

overwriting a spark output using pyspark

2 Answers 2

1 Comment

Update 2023

EDIT 2021

Original answer

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Update 2023

EDIT 2021

Original answer

1 Comment

Linked

Related