2

I have below RDD data set:

ABC [G4, G3, G1] 3 FFF [G5, G4, G3] 3 CDE [G5,G4,G3,G2] 4 XYZ [G4, G3] 2 

Need to sort by last column desc first, if last col is same, order by the the fist tuple item desc order. the expected result is

CDE [G5,G4,G3,G2] 4 FFF [G5, G4, G3] 3 ABC [G4, G3, G1] 3 XYZ [G4, G3] 2 

thanks in advance.

2 Answers 2

2

You can use sortBy:

rdd.sortBy(r => (r._3, r._2(0)), false) 

In the above, r._3 stands for the last column, r._2(0) for the first element of the second column (which is an array), and false specifies that the order should be descending. Bear in mind though that sorting is an expensive operation due to shuffling.

Update

Here's a reproducible example if we assume you start with a pair rdd:

/// Generate data val rdd = sc.parallelize(Seq(("ABC","G4"),("ABC","G3"), ("ABC","G1"),("FFF","G5"), ("FFF","G4"),("FFF","G3"), ("CDE","G5"),("CDE","G4"), ("CDE","G3"),("CDE","G2"), ("XYZ","G4"),("XYZ","G3"))) /// Put values in a list and calculate its size val rdd_new = rdd.groupByKey.mapValues(_.toList).map(x => (x._1, x._2, x._2.size)) /// Now this works rdd_new.sortBy(r => (r._3, r._2(0)), false).collect() /// Array[(String, List[String], Int)] = Array((CDE,List(G5, G4, G3, G2),4), (FFF,List(G5, G4, G3),3), (ABC,List(G4, G3, G1),3), (XYZ,List(G4, G3),2)) 
Sign up to request clarification or add additional context in comments.

4 Comments

Mtoto, I tried, but result looks not exactly as expected. (CDE ,[ G5, G4, G3, G2],4) (ABC ,[ G4, G3, G1],3) (FFF ,[ G5, G4, G3],3) (XYZ ,[ G4, G2],2). it order by last column desc correctly, but not correct by first item in array.
Hi Phoenix/Mtoto, thanks for you help. As I am really new for Spark. I think I did explain it properly this is a result from another process. I opened another stream for the quesiton. would you please help me through below link: stackoverflow.com/questions/41681804/…. thanks for you help.
What you need to do is share a reproducible example of your dataset, the new question you linked is essentially the same as this one. The problem is probably that your second column is a long string, you'll need to convert this to an array first, then the above should work.
rdd.sortBy(r => (r._3, r._1), false) - try this
0

I am not sure why the above answer is not working. It looks fine to me. Just try with this code.

Here is my input:

i1,array1,10 i5,array2,50 i4,array3,20 i2,array4,20 

Code:

val idRDD = sc.textFile(inputPath) val idSOrted = idRDD.map { rec => ((rec.split(",")(2),rec.split(",")(0)),(rec.split(",")(1))) }.sortByKey(false).map(rec=>(rec._1._1,rec._2,rec._1._2)) 

Here is the o/p:

(50,array2,i5) (20,array3,i4) (20,array4,i2) (10,array1,i1) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.