Non Deterministic Behaviour of UNION of RDD in Spark

Question

I'm performing Union operation on 3 RDD's, I'm aware Union doesn't preserve ordering but my in my case it is quite weird. Can someone explain me what's wrong in my code??

I've a (myDF)dataframe of rows and converted to RDD :-

myRdd = myDF.rdd.map(row => row.toSeq.toList.mkString(":")).map(rec => (2, rec)) myRdd.collect /* Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087 Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287 Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887 */ val rowCount = myRdd.count() // Count of Records in myRdd val header = "name:country:date:nextdate:1" // random header // Generating Header Rdd headerRdd = sparkContext.parallelize(Array(header), 1).map(rec => (1, rec)) //Generating Trailer Rdd val trailerRdd = sparkContext.parallelize(Array("T" + ":" + rowCount),1).map(rec => (3, rec)) //Performing Union val unionRdd = headerRdd.union(myRdd).union(trailerdd).map(rec => rec._2) unionRdd.saveAsTextFile("pathLocation")

As Union doesn't preserve ordering it should not give below result

Output

name:country:date:nextdate:1 Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087 Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287 Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887 T:3

Without using any sorting, How's that possible to get above output??

sortByKey("true", 1)

But When I Remove map from headerRdd, myRdd & TrailerRdd the oder is like

Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087 name:country:date:nextdate:1 Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287 Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887 T:3

What is the possible reason for above behaviour??

pandiarajan rajangam · Accepted Answer · 2020-06-09 05:38:47Z

0

In Spark, the elements within a particular partition are unordered, however the partitions themselves are ordered check this

answered Jun 9, 2020 at 5:38

pandiarajan rajangam

113 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Deepak Over a year ago

It's not odering when i remove map from rdd's. headerRdd = sparkContext.parallelize(Array(header), 1) val trailerRdd = sparkContext.parallelize(Array("T" + ":" + rowCount),1) myRdd = myDF.rdd.map(row => row.toSeq.toList.mkString(":")) val unionRdd = headerRdd.union(myRdd).union(trailerdd).saveAsTextFile("pathLocation")

Collectives™ on Stack Overflow

Non Deterministic Behaviour of UNION of RDD in Spark

Output

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Output

1 Answer 1

1 Comment

Linked

Related