I'm performing Union operation on 3 RDD's, I'm aware Union doesn't preserve ordering but my in my case it is quite weird. Can someone explain me what's wrong in my code??
I've a (myDF)dataframe of rows and converted to RDD :-
myRdd = myDF.rdd.map(row => row.toSeq.toList.mkString(":")).map(rec => (2, rec)) myRdd.collect /* Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087 Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287 Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887 */ val rowCount = myRdd.count() // Count of Records in myRdd val header = "name:country:date:nextdate:1" // random header // Generating Header Rdd headerRdd = sparkContext.parallelize(Array(header), 1).map(rec => (1, rec)) //Generating Trailer Rdd val trailerRdd = sparkContext.parallelize(Array("T" + ":" + rowCount),1).map(rec => (3, rec)) //Performing Union val unionRdd = headerRdd.union(myRdd).union(trailerdd).map(rec => rec._2) unionRdd.saveAsTextFile("pathLocation") As Union doesn't preserve ordering it should not give below result
Output
name:country:date:nextdate:1 Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087 Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287 Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887 T:3 Without using any sorting, How's that possible to get above output??
sortByKey("true", 1) But When I Remove map from headerRdd, myRdd & TrailerRdd the oder is like
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087 name:country:date:nextdate:1 Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287 Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887 T:3 What is the possible reason for above behaviour??