Reading complex nested json file in pyspark

Question

I'm having troubles for some days trying to resolve this.

I have a nested json file with a complex schema (array inside structure, structure inside array) and I need to put data in dataframe.

What I have in input is this (as an example):

+-----+----------------+-----------------------------------+---------+ | id | name | detail | item | +-----+----------------+-----------------------------------+---------+ | 100 | Peter Castle | [[D100A, Credit],[D100B, Debit]] | [10,31] | | 101 | Quino Yukimori | [[D101A, Credit],[D101B, Credit]] | [55,49] | +-----+----------------+-----------------------------------+---------+

I should read like this

+-----+----------------+-----------+--------+-----------+ | id | name | detail_id | type | item_qty | +-----+----------------+-----------+--------+-----------+ | 100 | Peter Castle | D100A | Credit | 10 | | 100 | Peter Castle | D100B | Debit | 31 | | 101 | Quino Yukimori | D101A | Credit | 55 | | 101 | Quino Yukimori | D101B | Credit | 49 | +-----+----------------+-----------+--------+-----------+

But what I get is this:

 df.withColumn('detail', explode('detail')).withColumn('item', explode('item')) +-----+----------------+-----------+--------+-----------+ | id | name | detail_id | type | item_qty | +-----+----------------+-----------+--------+-----------+ | 100 | Peter Castle | D100A | Credit | 10 | | 100 | Peter Castle | D100A | Debit | 10 | | 100 | Peter Castle | D100B | Credit | 31 | | 100 | Peter Castle | D100B | Debit | 31 | | 101 | Quino Yukimori | D101A | Credit | 55 | | 101 | Quino Yukimori | D101A | Credit | 55 | | 101 | Quino Yukimori | D101B | Credit | 49 | | 101 | Quino Yukimori | D101B | Credit | 49 | +-----+----------------+-----------+--------+-----------+

I have tried combining columns with arrays_zip and then explode, but the problem is there is array inside array, and if I explode detail array columns, the explode of item array columns multiply data.

Any idea how can I implement that?

Sorry about my english, is not my birth language.

UPDATED

Here is my schema, which complicates me reading it for the multiple nested arrays:

 |-- id: string(nullable = true) |-- name: string(nullable = true) |-- detail: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- detail_id: string(nullable = true) | | |-- type: string(nullable = true) |-- item: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- item_qty : long(nullable = true) |-- deliveryTrack: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- date: string(nullable = true) | | |-- track: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- time: string (nullable = true) | | | | |-- driver: string (nullable = true)

Also post the code that gave you that output

Bala
– Bala

2021-07-20 20:10:57 +00:00
Commented Jul 20, 2021 at 20:10 — Bala
– Bala, Commented Jul 20, 2021 at 20:10

AdibP · Accepted Answer · 2021-07-21 19:26:11Z

Use explode only once after you zip both arrays with arrays_zip. After that, use the expr function to get the data.

from pyspark.sql.functions import explode, arrays_zip, col, expr df1 = (df .withColumn('buffer', explode(arrays_zip(col('detail'), col('item')))) .withColumn('detail_id', expr("buffer.detail.detail_id")) .withColumn('type', expr("buffer.detail.type")) .withColumn('item_qty', expr("buffer.item.item_qty")) .drop(*['detail', 'item', 'buffer']) ) df1.show() +---+--------------+---------+------+--------+ |id |name |detail_id|type |item_qty| +---+--------------+---------+------+--------+ |100|Peter Castle |D100A |Credit|10 | |100|Peter Castle |D100B |Debit |31 | |101|Quino Yukimori|D101A |Credit|55 | |101|Quino Yukimori|D101B |Credit|49 | +---+--------------+---------+------+--------+

Hello, and thanks for helping me. I've tried your solution bu get this error: AnalysisException: cannot resolve 'element_at(buffer.detail, 1)' due to data type mismatch: The first argument to function element_at should have been array or map type, but its struct...
I've updated my question adding the schema for better understanding
Hey, it works! I'm so thankful- But what can I do if need to get time and driver columns, which are inside deliveryTrack array? Could you help me with that?. Because while more arrays I have to read, sames problems ocurrs. However, You answered my original question, thats why I marked as Answered.
you can include column deliveryTrack inside arrays_zip, then access the data later using nested column name like "buffer.deliveryTrack.track.driver"

Collectives™ on Stack Overflow

Reading complex nested json file in pyspark

UPDATED

1 Answer 1

8 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

UPDATED

1 Answer 1

8 Comments

Linked

Related