I have the following json file that I am reading in Spark:
{"id" : "123", "category" : "A"} {"id" : "456", "category" : "B"} {"id" : "567", "category" : "C"} {"id" : "678", "category" : "D"} {"id" : "789", "category" : "E"} {"id" : "123", "category" : "B"} {"id" : "123", "category" : "C"} I want to pivot it to generate mxn matrix of categories. Below is my code:
val df = spark.read.json("test.json") val pivots = df.groupBy("category").pivot("category").count() pivots.show() This generates the following output:
+--------+----+----+----+----+----+ |category| A| B| C| D| E| +--------+----+----+----+----+----+ | E|null|null|null|null| 1| | B|null| 2|null|null|null| | D|null|null|null| 1|null| | C|null|null| 2|null|null| | A| 1|null|null|null|null| +--------+----+----+----+----+----+ What I really want to do is, to pivot by id and show the counts in this matrix. E.g. id '123' is mapped to categories A, B and C, so I want the above matrix to show 1 value for A-A, A-B and A-C (and vice versa), currently, it's only showing 1 value for A-A.
I changed the code to group by 'id' but that results in id values in first column whereas I want categories.
Is there any way I can do this (maybe using a completely different approach)?