split Json array into two rows spark scala

Question

I have a dataframe like this:

root |-- runKeyId: string (nullable = true) |-- entities: string (nullable = true)

+--------+--------------------------------------------------------------------------------------------+ |runKeyId|entities | +--------+--------------------------------------------------------------------------------------------+ |1 |{"Partition":[{"Name":"ABC"},{"Name":"DBC"}],"id":339},{"Partition":{"Name":"DDD"},"id":339}|

and I would like to explode into this with scala:

+--------+--------------------------------------------------------------------------------------------+ |runKeyId|entities | +--------+--------------------------------------------------------------------------------------------+ |1 |{"Partition":[{"Name":"ABC"},{"Name":"DBC"}],"id":339} +--------+--------------------------------------------------------------------------------------------+ |2 |{"Partition":{"Name":"DDD"},"id":339} +--------+--------------------------------------------------------------------------------------------+

how did you read the file? it looks like jsonl format then you can simply read spark.read.json("json_path") automatically separates json to rows. — Daeho Ro
– Daeho Ro, Commented Aug 17, 2020 at 7:00
I have provided answer for the similar question here. Please have a look - stackoverflow.com/a/63375812/4758823 — Som
– Som, Commented Aug 17, 2020 at 8:42

koiralo · Accepted Answer · 2020-08-17 08:29:48Z

Looks like you don't have a valid JSON, So fix the JSON first and then you can read as JSON and explode it as below.

val df = Seq( ("1", "{\"Partition\":[{\"Name\":\"ABC\"},{\"Name\":\"DBC\"}],\"id\":339},{\"Partition\":{\"Name\":\"DDD\"},\"id\":339}") ).toDF("runKeyId", "entities") .withColumn("entities", concat(lit("["), $"entities", lit("]"))) //fix the json val resultDF = df.withColumn("entities", explode(from_json($"entities", schema_of_json(df.select($"entities").first().getString(0)))) ).withColumn("entities", to_json($"entities")) resultDF.show(false)

Output:

+--------+----------------------------------------------------------------+ |runKeyId|entities | +--------+----------------------------------------------------------------+ |1 |{"Partition":"[{\"Name\":\"ABC\"},{\"Name\":\"DBC\"}]","id":339}| |1 |{"Partition":"{\"Name\":\"DDD\"}","id":339} | +--------+----------------------------------------------------------------+

is it possible to get result as json instead of string because it is affecting further logic
root |-- Id: string (nullable = true) |-- entities: json (nullable = true)
What do you mean by result as json, there is no such Json Type? can you share how your output should look like, or the output schema ?
it is add an extra " " in "[{\"Name\":\"ABC\"},{\"Name\":\"DBC\"}]" making it as a string
@shreypavagadhi This is because you have invalid jason again in the partition as well. Where first partition is array type and second is object

Collectives™ on Stack Overflow

split Json array into two rows spark scala

1 Answer 1

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Related