1

I'm having troubles for some days trying to resolve this.

I have a nested json file with a complex schema (array inside structure, structure inside array) and I need to put data in dataframe.

What I have in input is this (as an example):

+-----+----------------+-----------------------------------+---------+ | id | name | detail | item | +-----+----------------+-----------------------------------+---------+ | 100 | Peter Castle | [[D100A, Credit],[D100B, Debit]] | [10,31] | | 101 | Quino Yukimori | [[D101A, Credit],[D101B, Credit]] | [55,49] | +-----+----------------+-----------------------------------+---------+ 

I should read like this

+-----+----------------+-----------+--------+-----------+ | id | name | detail_id | type | item_qty | +-----+----------------+-----------+--------+-----------+ | 100 | Peter Castle | D100A | Credit | 10 | | 100 | Peter Castle | D100B | Debit | 31 | | 101 | Quino Yukimori | D101A | Credit | 55 | | 101 | Quino Yukimori | D101B | Credit | 49 | +-----+----------------+-----------+--------+-----------+ 

But what I get is this:

 df.withColumn('detail', explode('detail')).withColumn('item', explode('item')) +-----+----------------+-----------+--------+-----------+ | id | name | detail_id | type | item_qty | +-----+----------------+-----------+--------+-----------+ | 100 | Peter Castle | D100A | Credit | 10 | | 100 | Peter Castle | D100A | Debit | 10 | | 100 | Peter Castle | D100B | Credit | 31 | | 100 | Peter Castle | D100B | Debit | 31 | | 101 | Quino Yukimori | D101A | Credit | 55 | | 101 | Quino Yukimori | D101A | Credit | 55 | | 101 | Quino Yukimori | D101B | Credit | 49 | | 101 | Quino Yukimori | D101B | Credit | 49 | +-----+----------------+-----------+--------+-----------+ 

I have tried combining columns with arrays_zip and then explode, but the problem is there is array inside array, and if I explode detail array columns, the explode of item array columns multiply data.

Any idea how can I implement that?

Sorry about my english, is not my birth language.

UPDATED

Here is my schema, which complicates me reading it for the multiple nested arrays:

 |-- id: string(nullable = true) |-- name: string(nullable = true) |-- detail: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- detail_id: string(nullable = true) | | |-- type: string(nullable = true) |-- item: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- item_qty : long(nullable = true) |-- deliveryTrack: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- date: string(nullable = true) | | |-- track: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- time: string (nullable = true) | | | | |-- driver: string (nullable = true) 
1
  • Also post the code that gave you that output Commented Jul 20, 2021 at 20:10

1 Answer 1

2

Use explode only once after you zip both arrays with arrays_zip. After that, use the expr function to get the data.

from pyspark.sql.functions import explode, arrays_zip, col, expr df1 = (df .withColumn('buffer', explode(arrays_zip(col('detail'), col('item')))) .withColumn('detail_id', expr("buffer.detail.detail_id")) .withColumn('type', expr("buffer.detail.type")) .withColumn('item_qty', expr("buffer.item.item_qty")) .drop(*['detail', 'item', 'buffer']) ) df1.show() +---+--------------+---------+------+--------+ |id |name |detail_id|type |item_qty| +---+--------------+---------+------+--------+ |100|Peter Castle |D100A |Credit|10 | |100|Peter Castle |D100B |Debit |31 | |101|Quino Yukimori|D101A |Credit|55 | |101|Quino Yukimori|D101B |Credit|49 | +---+--------------+---------+------+--------+ 
Sign up to request clarification or add additional context in comments.

8 Comments

Hello, and thanks for helping me. I've tried your solution bu get this error: AnalysisException: cannot resolve 'element_at(buffer.detail, 1)' due to data type mismatch: The first argument to function element_at should have been array or map type, but its struct...
I've updated my question adding the schema for better understanding
Tried to replicate your schema, I've edited my answer.
Hey, it works! I'm so thankful- But what can I do if need to get time and driver columns, which are inside deliveryTrack array? Could you help me with that?. Because while more arrays I have to read, sames problems ocurrs. However, You answered my original question, thats why I marked as Answered.
you can include column deliveryTrack inside arrays_zip, then access the data later using nested column name like "buffer.deliveryTrack.track.driver"
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.