Creating a schema for a nested Pyspark object

Question

I'm attempting to create some sample data frames to put test against some functions, regularly I get JSON objects with nested objects(arrays or more JSON objects), I need to test these for differing types, namely Struct and Array and pass them to a correct function based on their type to create a tabular dataframe.

These objects are from APIs some internal, some external so I'm at the mercy of app developers.

Let's assume I want to create a table as follows to test against:

+----+------+------------------------------+ | | id | arr | |----+------+------------------------------| | 0 | 1 | [[0, 1, 2, 3], [4, 5, 6, 7]] | | 1 | 2 | [[1, 2, 3], [4, 5, 6]] | +----+------+------------------------------+

My assumption would be I would need to create a schema as follows:

from pyspark.sql.types import StructField,StructType,StringType,IntegerType,ArrayType schema = StructType([ StructField('id', IntegerType(),True), StructField('arr', ArrayType(ArrayType(IntegerType(),True),True),True) ]) data = [ [1,2], #< id. [[ [0,1,2,3], [4,5,6,7]], # < arr [[1,2,3,],[4,5,6]]] ] df = spark.createDataFrame(data,schema)

which returns a TypeError:

field arr: ArrayType(IntegerType,true) can not accept object 2 in type <class 'int'>

Where have I made an error?

When all is said and done this the output I will get when I've passed these through a recursive function:

+----+------+-------+ | | id | arr | |----+------+-------| | 0 | 1 | 0 | | 0 | 1 | 1 | | 0 | 1 | 2 | | 0 | 1 | 3 | | 0 | 1 | 4 | | 0 | 1 | 5 | | 0 | 1 | 6 | | 0 | 1 | 7 | | 1 | 2 | 1 | | 1 | 2 | 2 | | 1 | 2 | 3 | | 1 | 2 | 4 | | 1 | 2 | 5 | | 1 | 2 | 6 | +----+------+-------+

mck · Accepted Answer · 2021-02-23 12:54:32Z

The list data should contain a list of rows, instead of a list of columns.

from pyspark.sql.types import StructField,StructType,StringType,IntegerType,ArrayType schema = StructType([ StructField('id', IntegerType(),True), StructField('arr', ArrayType(ArrayType(IntegerType(),True),True),True) ]) data = [ [1, [[0,1,2,3], [4,5,6,7]] ], [2, [[1,2,3,],[4,5,6]] ] ] df = spark.createDataFrame(data,schema) df.show(truncate=False) +---+----------------------------+ |id |arr | +---+----------------------------+ |1 |[[0, 1, 2, 3], [4, 5, 6, 7]]| |2 |[[1, 2, 3], [4, 5, 6]] | +---+----------------------------+

For exploding the arrays you can do this:

import pyspark.sql.functions as F df.withColumn('arr', F.explode(F.flatten('arr'))).show() +---+---+ | id|arr| +---+---+ | 1| 0| | 1| 1| | 1| 2| | 1| 3| | 1| 4| | 1| 5| | 1| 6| | 1| 7| | 2| 1| | 2| 2| | 2| 3| | 2| 4| | 2| 5| | 2| 6| +---+---+

Collectives™ on Stack Overflow

Creating a schema for a nested Pyspark object

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related