5

In cassandra I have a list column type. I am new to spark and scala, and have no idea where to start. In spark I want get count of each values, is it possible to do so. Below is the dataframe

+--------------------+------------+ | id| data| +--------------------+------------+ |53e5c3b0-8c83-11e...| [b, c]| |508c1160-8c83-11e...| [a, b]| |4d16c0c0-8c83-11e...| [a, b, c]| |5774dde0-8c83-11e...|[a, b, c, d]| +--------------------+------------+ 

I want output as

+--------------------+------------+ | value | count | +--------------------+------------+ |a | 3 | |b | 4 | |c | 3 | |d | 1 | +--------------------+------------+ 

spark version: 1.4

0

2 Answers 2

8

Here you go :

scala> val rdd = sc.parallelize( Seq( ("53e5c3b0-8c83-11e", Array("b", "c")), ("53e5c3b0-8c83-11e1", Array("a", "b")), ("53e5c3b0-8c83-11e2", Array("a", "b", "c")), ("53e5c3b0-8c83-11e3", Array("a", "b", "c", "d")))) // rdd: org.apache.spark.rdd.RDD[(String, Array[String])] = ParallelCollectionRDD[22] at parallelize at <console>:27 scala> rdd.flatMap(_._2).map((_, 1)).reduceByKey(_ + _) // res11: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[21] at reduceByKey at <console>:30 scala> rdd.flatMap(_._2).map((_,1)).reduceByKey(_ + _).collect // res16: Array[(String, Int)] = Array((a,3), (b,4), (c,3), (d,1)) 

This is also actually quite easy with the DataFrame API :

scala> val df = rdd.toDF("id", "data") // res12: org.apache.spark.sql.DataFrame = ["id": string, "data": array<string>] scala> df.select(explode($"data").as("value")).groupBy("value").count.show // +-----+-----+ // |value|count| // +-----+-----+ // | d| 1| // | c| 3| // | b| 4| // | a| 3| // +-----+-----+ 
Sign up to request clarification or add additional context in comments.

1 Comment

Can you provide pyspark implementation of the solution?
2

You need something like this (from Apache Spark Examples):

val textFile = sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) 

Guessing that you already have pairs, .reduceByKey(_ + _) will return what you need.

You can also try in spark shell something like this:

sc.parallelize(Array[Integer](1,1,1,2,2),3).map(x=>(x,1)).reduceByKey(_+_).foreach(println) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.