3

I have the following json file that I am reading in Spark:

{"id" : "123", "category" : "A"} {"id" : "456", "category" : "B"} {"id" : "567", "category" : "C"} {"id" : "678", "category" : "D"} {"id" : "789", "category" : "E"} {"id" : "123", "category" : "B"} {"id" : "123", "category" : "C"} 

I want to pivot it to generate mxn matrix of categories. Below is my code:

val df = spark.read.json("test.json") val pivots = df.groupBy("category").pivot("category").count() pivots.show() 

This generates the following output:

+--------+----+----+----+----+----+ |category| A| B| C| D| E| +--------+----+----+----+----+----+ | E|null|null|null|null| 1| | B|null| 2|null|null|null| | D|null|null|null| 1|null| | C|null|null| 2|null|null| | A| 1|null|null|null|null| +--------+----+----+----+----+----+ 

What I really want to do is, to pivot by id and show the counts in this matrix. E.g. id '123' is mapped to categories A, B and C, so I want the above matrix to show 1 value for A-A, A-B and A-C (and vice versa), currently, it's only showing 1 value for A-A.

I changed the code to group by 'id' but that results in id values in first column whereas I want categories.

Is there any way I can do this (maybe using a completely different approach)?

1 Answer 1

3

Fist rename the columns and apply self-join:

val leftRight = df .withColumnRenamed("category", "left") .join(df.withColumnRenamed("category", "right"), Seq("id")) 

to get co-occurrences for each id. Next apply crosstab:

leftRight.stat.crosstab("left", "right") 

to aggregate data across all ids. The result is:

+----------+---+---+---+---+---+ |left_right| A| B| C| D| E| +----------+---+---+---+---+---+ | E| 0| 0| 0| 0| 1| | A| 1| 1| 1| 0| 0| | B| 1| 2| 1| 0| 0| | C| 1| 1| 2| 0| 0| | D| 0| 0| 0| 1| 0| +----------+---+---+---+---+---+ 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.