1

I have a dataframe like this:

root |-- runKeyId: string (nullable = true) |-- entities: string (nullable = true) 
+--------+--------------------------------------------------------------------------------------------+ |runKeyId|entities | +--------+--------------------------------------------------------------------------------------------+ |1 |{"Partition":[{"Name":"ABC"},{"Name":"DBC"}],"id":339},{"Partition":{"Name":"DDD"},"id":339}| 

and I would like to explode into this with scala:

+--------+--------------------------------------------------------------------------------------------+ |runKeyId|entities | +--------+--------------------------------------------------------------------------------------------+ |1 |{"Partition":[{"Name":"ABC"},{"Name":"DBC"}],"id":339} +--------+--------------------------------------------------------------------------------------------+ |2 |{"Partition":{"Name":"DDD"},"id":339} +--------+--------------------------------------------------------------------------------------------+ 
6
  • how did you read the file? it looks like jsonl format then you can simply read spark.read.json("json_path") automatically separates json to rows. Commented Aug 17, 2020 at 7:00
  • Here input i am getting it as a string and not json Commented Aug 17, 2020 at 7:04
  • how you are reading the data of input jsons? Commented Aug 17, 2020 at 7:05
  • val parseDF = decompressDataDF .select($"_1.entities") Commented Aug 17, 2020 at 7:13
  • I have provided answer for the similar question here. Please have a look - stackoverflow.com/a/63375812/4758823 Commented Aug 17, 2020 at 8:42

1 Answer 1

1

Looks like you don't have a valid JSON, So fix the JSON first and then you can read as JSON and explode it as below.

val df = Seq( ("1", "{\"Partition\":[{\"Name\":\"ABC\"},{\"Name\":\"DBC\"}],\"id\":339},{\"Partition\":{\"Name\":\"DDD\"},\"id\":339}") ).toDF("runKeyId", "entities") .withColumn("entities", concat(lit("["), $"entities", lit("]"))) //fix the json val resultDF = df.withColumn("entities", explode(from_json($"entities", schema_of_json(df.select($"entities").first().getString(0)))) ).withColumn("entities", to_json($"entities")) resultDF.show(false) 

Output:

+--------+----------------------------------------------------------------+ |runKeyId|entities | +--------+----------------------------------------------------------------+ |1 |{"Partition":"[{\"Name\":\"ABC\"},{\"Name\":\"DBC\"}]","id":339}| |1 |{"Partition":"{\"Name\":\"DDD\"}","id":339} | +--------+----------------------------------------------------------------+ 
Sign up to request clarification or add additional context in comments.

5 Comments

is it possible to get result as json instead of string because it is affecting further logic
root |-- Id: string (nullable = true) |-- entities: json (nullable = true)
What do you mean by result as json, there is no such Json Type? can you share how your output should look like, or the output schema ?
it is add an extra " " in "[{\"Name\":\"ABC\"},{\"Name\":\"DBC\"}]" making it as a string
@shreypavagadhi This is because you have invalid jason again in the partition as well. Where first partition is array type and second is object

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.