1

How to change schema in PySpark from this

|-- id: string (nullable = true) |-- device: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- device_vendor: string (nullable = true) | | |-- device_name: string (nullable = true) | | |-- device_manufacturer: string (nullable = true) 

to this

|-- id: string (nullable = true) |-- device_vendor: string (nullable = true) |-- device_name: string (nullable = true) |-- device_manufacturer: string (nullable = true) 

2 Answers 2

1

Use a combination of explode and the * selector:

import pyspark.sql.functions as F df_flat = df.withColumn('device_exploded', F.explode('device')) .select('id', 'device_exploded.*') df_flat.printSchema() # root # |-- id: string (nullable = true) # |-- device_vendor: string (nullable = true) # |-- device_name: string (nullable = true) # |-- device_manufacturer: string (nullable = true) 

explode creates a separate record for each element of the array-valued column, repeating the value(s) of the other column(s). The column.* selector turns all fields of the struct-valued column into separate columns.

Sign up to request clarification or add additional context in comments.

Comments

1

First, take the first array's element using element_at, then extract all elements from struct using *.

df = df.withColumn('d', F.element_at('device', 1)) df = df.select('id', 'd.*') 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.