Spark : Pivot with multiple columns

Question

I have the following json file that I am reading in Spark:

{"id" : "123", "category" : "A"} {"id" : "456", "category" : "B"} {"id" : "567", "category" : "C"} {"id" : "678", "category" : "D"} {"id" : "789", "category" : "E"} {"id" : "123", "category" : "B"} {"id" : "123", "category" : "C"}

I want to pivot it to generate mxn matrix of categories. Below is my code:

val df = spark.read.json("test.json") val pivots = df.groupBy("category").pivot("category").count() pivots.show()

This generates the following output:

+--------+----+----+----+----+----+ |category| A| B| C| D| E| +--------+----+----+----+----+----+ | E|null|null|null|null| 1| | B|null| 2|null|null|null| | D|null|null|null| 1|null| | C|null|null| 2|null|null| | A| 1|null|null|null|null| +--------+----+----+----+----+----+

What I really want to do is, to pivot by id and show the counts in this matrix. E.g. id '123' is mapped to categories A, B and C, so I want the above matrix to show 1 value for A-A, A-B and A-C (and vice versa), currently, it's only showing 1 value for A-A.

I changed the code to group by 'id' but that results in id values in first column whereas I want categories.

Is there any way I can do this (maybe using a completely different approach)?

zero323 · Accepted Answer · 2017-03-22 17:17:49Z

Fist rename the columns and apply self-join:

val leftRight = df .withColumnRenamed("category", "left") .join(df.withColumnRenamed("category", "right"), Seq("id"))

to get co-occurrences for each id. Next apply crosstab:

leftRight.stat.crosstab("left", "right")

to aggregate data across all ids. The result is:

+----------+---+---+---+---+---+ |left_right| A| B| C| D| E| +----------+---+---+---+---+---+ | E| 0| 0| 0| 0| 1| | A| 1| 1| 1| 0| 0| | B| 1| 2| 1| 0| 0| | C| 1| 1| 2| 0| 0| | D| 0| 0| 0| 1| 0| +----------+---+---+---+---+---+

Collectives™ on Stack Overflow

Spark : Pivot with multiple columns

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related