I'm having troubles for some days trying to resolve this.
I have a nested json file with a complex schema (array inside structure, structure inside array) and I need to put data in dataframe.
What I have in input is this (as an example):
+-----+----------------+-----------------------------------+---------+ | id | name | detail | item | +-----+----------------+-----------------------------------+---------+ | 100 | Peter Castle | [[D100A, Credit],[D100B, Debit]] | [10,31] | | 101 | Quino Yukimori | [[D101A, Credit],[D101B, Credit]] | [55,49] | +-----+----------------+-----------------------------------+---------+ I should read like this
+-----+----------------+-----------+--------+-----------+ | id | name | detail_id | type | item_qty | +-----+----------------+-----------+--------+-----------+ | 100 | Peter Castle | D100A | Credit | 10 | | 100 | Peter Castle | D100B | Debit | 31 | | 101 | Quino Yukimori | D101A | Credit | 55 | | 101 | Quino Yukimori | D101B | Credit | 49 | +-----+----------------+-----------+--------+-----------+ But what I get is this:
df.withColumn('detail', explode('detail')).withColumn('item', explode('item')) +-----+----------------+-----------+--------+-----------+ | id | name | detail_id | type | item_qty | +-----+----------------+-----------+--------+-----------+ | 100 | Peter Castle | D100A | Credit | 10 | | 100 | Peter Castle | D100A | Debit | 10 | | 100 | Peter Castle | D100B | Credit | 31 | | 100 | Peter Castle | D100B | Debit | 31 | | 101 | Quino Yukimori | D101A | Credit | 55 | | 101 | Quino Yukimori | D101A | Credit | 55 | | 101 | Quino Yukimori | D101B | Credit | 49 | | 101 | Quino Yukimori | D101B | Credit | 49 | +-----+----------------+-----------+--------+-----------+ I have tried combining columns with arrays_zip and then explode, but the problem is there is array inside array, and if I explode detail array columns, the explode of item array columns multiply data.
Any idea how can I implement that?
Sorry about my english, is not my birth language.
UPDATED
Here is my schema, which complicates me reading it for the multiple nested arrays:
|-- id: string(nullable = true) |-- name: string(nullable = true) |-- detail: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- detail_id: string(nullable = true) | | |-- type: string(nullable = true) |-- item: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- item_qty : long(nullable = true) |-- deliveryTrack: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- date: string(nullable = true) | | |-- track: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- time: string (nullable = true) | | | | |-- driver: string (nullable = true)