1

I'm performing Union operation on 3 RDD's, I'm aware Union doesn't preserve ordering but my in my case it is quite weird. Can someone explain me what's wrong in my code??

I've a (myDF)dataframe of rows and converted to RDD :-

myRdd = myDF.rdd.map(row => row.toSeq.toList.mkString(":")).map(rec => (2, rec)) myRdd.collect /* Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087 Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287 Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887 */ val rowCount = myRdd.count() // Count of Records in myRdd val header = "name:country:date:nextdate:1" // random header // Generating Header Rdd headerRdd = sparkContext.parallelize(Array(header), 1).map(rec => (1, rec)) //Generating Trailer Rdd val trailerRdd = sparkContext.parallelize(Array("T" + ":" + rowCount),1).map(rec => (3, rec)) //Performing Union val unionRdd = headerRdd.union(myRdd).union(trailerdd).map(rec => rec._2) unionRdd.saveAsTextFile("pathLocation") 

As Union doesn't preserve ordering it should not give below result

Output

name:country:date:nextdate:1 Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087 Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287 Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887 T:3 

Without using any sorting, How's that possible to get above output??

sortByKey("true", 1) 

But When I Remove map from headerRdd, myRdd & TrailerRdd the oder is like

Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087 name:country:date:nextdate:1 Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287 Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887 T:3 

What is the possible reason for above behaviour??

1 Answer 1

0

In Spark, the elements within a particular partition are unordered, however the partitions themselves are ordered check this

Sign up to request clarification or add additional context in comments.

1 Comment

It's not odering when i remove map from rdd's. headerRdd = sparkContext.parallelize(Array(header), 1) val trailerRdd = sparkContext.parallelize(Array("T" + ":" + rowCount),1) myRdd = myDF.rdd.map(row => row.toSeq.toList.mkString(":")) val unionRdd = headerRdd.union(myRdd).union(trailerdd).saveAsTextFile("pathLocation")

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.