how to order spark RDD based on two columns

Question

I have below RDD data set:

ABC [G4, G3, G1] 3 FFF [G5, G4, G3] 3 CDE [G5,G4,G3,G2] 4 XYZ [G4, G3] 2

Need to sort by last column desc first, if last col is same, order by the the fist tuple item desc order. the expected result is

CDE [G5,G4,G3,G2] 4 FFF [G5, G4, G3] 3 ABC [G4, G3, G1] 3 XYZ [G4, G3] 2

thanks in advance.

mtoto · Accepted Answer · 2017-01-17 10:08:16Z

You can use sortBy:

rdd.sortBy(r => (r._3, r._2(0)), false)

In the above, r._3 stands for the last column, r._2(0) for the first element of the second column (which is an array), and false specifies that the order should be descending. Bear in mind though that sorting is an expensive operation due to shuffling.

Update

Here's a reproducible example if we assume you start with a pair rdd:

/// Generate data val rdd = sc.parallelize(Seq(("ABC","G4"),("ABC","G3"), ("ABC","G1"),("FFF","G5"), ("FFF","G4"),("FFF","G3"), ("CDE","G5"),("CDE","G4"), ("CDE","G3"),("CDE","G2"), ("XYZ","G4"),("XYZ","G3"))) /// Put values in a list and calculate its size val rdd_new = rdd.groupByKey.mapValues(_.toList).map(x => (x._1, x._2, x._2.size)) /// Now this works rdd_new.sortBy(r => (r._3, r._2(0)), false).collect() /// Array[(String, List[String], Int)] = Array((CDE,List(G5, G4, G3, G2),4), (FFF,List(G5, G4, G3),3), (ABC,List(G4, G3, G1),3), (XYZ,List(G4, G3),2))

Mtoto, I tried, but result looks not exactly as expected. (CDE ,[ G5, G4, G3, G2],4) (ABC ,[ G4, G3, G1],3) (FFF ,[ G5, G4, G3],3) (XYZ ,[ G4, G2],2). it order by last column desc correctly, but not correct by first item in array.
Hi Phoenix/Mtoto, thanks for you help. As I am really new for Spark. I think I did explain it properly this is a result from another process. I opened another stream for the quesiton. would you please help me through below link: stackoverflow.com/questions/41681804/…. thanks for you help.
What you need to do is share a reproducible example of your dataset, the new question you linked is essentially the same as this one. The problem is probably that your second column is a long string, you'll need to convert this to an array first, then the above should work.

arghtype · Accepted Answer · 2018-10-10 17:36:46Z

I am not sure why the above answer is not working. It looks fine to me. Just try with this code.

Here is my input:

i1,array1,10 i5,array2,50 i4,array3,20 i2,array4,20

Code:

val idRDD = sc.textFile(inputPath) val idSOrted = idRDD.map { rec => ((rec.split(",")(2),rec.split(",")(0)),(rec.split(",")(1))) }.sortByKey(false).map(rec=>(rec._1._1,rec._2,rec._1._2))

Here is the o/p:

(50,array2,i5) (20,array3,i4) (20,array4,i2) (10,array1,i1)

Collectives™ on Stack Overflow

how to order spark RDD based on two columns

2 Answers 2

4 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Related