How can I append to same file in HDFS(spark 2.11)

Question

I am trying to store Stream Data into HDFS using SparkStreaming,but it Keep creating in new file insted of appending into one single file or few multiple files

If it keep creating n numbers of files,i feel it won't be much efficient

HDFS FILE SYSYTEM

Code

lines.foreachRDD(f => { if (!f.isEmpty()) { val df = f.toDF().coalesce(1) df.write.mode(SaveMode.Append).json("hdfs://localhost:9000/MT9") } })

In my pom I am using respective dependencies:

spark-core_2.11
spark-sql_2.11
spark-streaming_2.11
spark-streaming-kafka-0-10_2.11

If you're reading data from Kafka into HDFS, I suggest you look at using Nifi or Kafka Connect. Don't rewrite code for existing solutions — OneCricketeer
– OneCricketeer, Commented Jun 25, 2018 at 11:14
hdfs is meant to be write once and read many times,you cannot be able to write to same file .in order to do that you had to do compaction kind of process which hive and hbase follows — sai pradeep kumar kotha
– sai pradeep kumar kotha, Commented Jun 25, 2018 at 13:00

user9988523 · Accepted Answer · 2018-06-25 10:10:14Z

6

As you already realized Append in Spark means write-to-existing-directory not append-to-file.

This is intentional and desired behavior (think what would happen if process failed in the middle of "appending" even if format and file system allow that).

Operations like merging files should be applied by a separate process, if necessary at all, which ensures correctness and fault tolerance. Unfortunately this requires a full copy which, for obvious reasons is not desired on batch-to-batch basis.

answered Jun 25, 2018 at 10:10

user9988523

611 bronze badge

Sign up to request clarification or add additional context in comments.

9 Comments

andani Over a year ago

you can go through this link : spark.apache.org/docs/2.1.1/api/java/org/apache/spark/sql/… Append mode means that when saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.

OneCricketeer Over a year ago

@andani That's appending in Spark... For HDFS, appending means adding new files into a directory rather than overwriting that directory completely

andani Over a year ago

@cricket_007 then is their any way to store data in same file as it is their in Storm

OneCricketeer Over a year ago

@andani I have never used Storm, but I known it isn't used for persistent data storage

andani Over a year ago

@cricket_007 what i ment to say is their built in libraries which stores data in hdfs in required fashion.

|

Chandan Ray · Accepted Answer · 2018-06-25 10:34:41Z

-1

It’s creating file for each rdd as every time you are reinitialising the DataFrame variable. I would suggest have a DataFrame variable and assign as null outside of loop and inside each rdd union with the local DataFrame. After the loop write using the outer DataFrame.

answered Jun 25, 2018 at 10:34

Chandan Ray

2,0911 gold badge13 silver badges16 bronze badges

4 Comments

andani Over a year ago

still the same case

andani Over a year ago

var empty = sqlContext.emptyDataFrame lines.foreachRDD(f => { if (!f.isEmpty()) { empty = f.toDF().coalesce(1) empty.write.mode(SaveMode.Append).json(warehouseLocation) } })

Chandan Ray Over a year ago

Inside your condition add this if(empty == null) empty = f.toDF() else empty = empty.union(f.toDF()) after the loop ends empty.coalesce(1).write.mode rest your option. Please do not write it inside the loop

andani Over a year ago

df doesnot have fiexd no. coloum, so getting error for your condition.

Collectives™ on Stack Overflow

How can I append to same file in HDFS(spark 2.11)

2 Answers 2

9 Comments

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

4 Comments

Linked

Related