How to transform nested dataframe schema in PySpark

Question

I have a dataframe with the following schema:

root |-- _1: struct (nullable = true) | |-- key: string (nullable = true) |-- _2: struct (nullable = true) | |-- value: long (nullable = true)

I want to transform dataframe to the following schema:

root |-- _1: struct (nullable = true) | |-- key: string (nullable = true) | |-- value: long (nullable = true)

Community · Accepted Answer · 2020-06-20 09:12:55Z

Use struct:

pyspark.sql.functions.struct(*cols)

Creates a new struct column.

from pyspark.sql.functions import struct, col from pyspark.sql import Row df = spark.createDataFrame([Row(_1=Row(key="a"), _2=Row(value=1))]) result = df.select(struct(col("_1.key"), col("_2.value")).alias("_1"))

which gives:

result.printSchema() # root # |-- _1: struct (nullable = false) # | |-- key: string (nullable = true) # | |-- value: long (nullable = true)

and

result.show() # +-----+ # | _1| # +-----+ # |[a,1]| # +-----+

Anahcolus · Accepted Answer · 2018-02-16 00:58:42Z

If your dataframe is with following schema

root |-- _1: struct (nullable = true) | |-- key: string (nullable = true) |-- _2: struct (nullable = true) | |-- value: long (nullable = true)

Then you can use * to select all elements of struct columns into separate columns and then use struct inbuilt function to combine them back to one struct field

from pyspark.sql import functions as F df.select(F.struct("_1.*", "_2.*").alias("_1"))

you should get your desired output dataframe

root |-- _1: struct (nullable = false) | |-- key: string (nullable = true) | |-- value: long (nullable = true)

Updated

More generalized form of above code if all the columns in original dataframe are struct is as below

df.select(F.struct(["{}.*".format(x) for x in df.columns]).alias("_1"))

Collectives™ on Stack Overflow

How to transform nested dataframe schema in PySpark

2 Answers 2

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Related