Split array of structs from JSON into Dataframe rows in SPARK

Question

I am reading Kafka through Spark Structured streaming. The input Kafka message is of the below JSON format:

[ { "customer": "Jim", "sex": "male", "country": "US" }, { "customer": "Pam", "sex": "female", "country": "US" } ]

I have the define the schema like below to parse it:

val schemaAsJson = ArrayType(StructType(Seq( StructField("customer",StringType,true), StructField("sex",StringType,true), StructField("country",StringType,true))),true)

My code looks like this,

df.select(from_json($"col", schemaAsJson) as "json") .select("json.customer","json.sex","json.country")

The current output looks like this,

+--------------+----------------+----------------+ | customer| sex|country | +--------------+----------------+----------------+ | [Jim, Pam]| [male, female]| [US, US]| +--------------+----------------+----------------+

Expected output:

+--------------+----------------+----------------+ | customer| sex| country| +--------------+----------------+----------------+ | Jim| male| US| | Pam| female| US| +--------------+----------------+----------------+

How do I split array of structs into individual rows as above? Can someone please help?

Mohana B C · Accepted Answer · 2022-11-22 14:58:07Z

2

You need explode column before selecting.

df.select(explode_outer(from_json($"value", schemaAsJson)) as "json") .select("json.customer","json.sex","json.country").show()

edited Nov 22, 2022 at 14:58

answered Nov 22, 2022 at 14:19

Mohana B C

5,4721 gold badge13 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Harikrishnan Balachandran Over a year ago

I tried it. I am getting this error: Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'json.customer' given input columns: [col];

Mohana B C Over a year ago

Updated the answer, check now

Mohana B C Over a year ago

Sorry, that's typo

Harikrishnan Balachandran Over a year ago

Thank you very much. It works as expected! My Actual data is little bit more complex in nested structure. Like within the parent struct I will have another array of struct like "previous employment": [ {"emp1Details":""}, {"emp2Details":""} ]. I am yet to try you solution on that. But will the explode_outer still work for them?

Mohana B C Over a year ago

For all array columns, you can use explode function.

Collectives™ on Stack Overflow

Split array of structs from JSON into Dataframe rows in SPARK

1 Answer 1

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Related