0

I'm attempting to create some sample data frames to put test against some functions, regularly I get JSON objects with nested objects(arrays or more JSON objects), I need to test these for differing types, namely Struct and Array and pass them to a correct function based on their type to create a tabular dataframe.

These objects are from APIs some internal, some external so I'm at the mercy of app developers.

Let's assume I want to create a table as follows to test against:

+----+------+------------------------------+ | | id | arr | |----+------+------------------------------| | 0 | 1 | [[0, 1, 2, 3], [4, 5, 6, 7]] | | 1 | 2 | [[1, 2, 3], [4, 5, 6]] | +----+------+------------------------------+ 

My assumption would be I would need to create a schema as follows:

from pyspark.sql.types import StructField,StructType,StringType,IntegerType,ArrayType schema = StructType([ StructField('id', IntegerType(),True), StructField('arr', ArrayType(ArrayType(IntegerType(),True),True),True) ]) data = [ [1,2], #< id. [[ [0,1,2,3], [4,5,6,7]], # < arr [[1,2,3,],[4,5,6]]] ] df = spark.createDataFrame(data,schema) 

which returns a TypeError:

field arr: ArrayType(IntegerType,true) can not accept object 2 in type <class 'int'> 

Where have I made an error?

When all is said and done this the output I will get when I've passed these through a recursive function:

+----+------+-------+ | | id | arr | |----+------+-------| | 0 | 1 | 0 | | 0 | 1 | 1 | | 0 | 1 | 2 | | 0 | 1 | 3 | | 0 | 1 | 4 | | 0 | 1 | 5 | | 0 | 1 | 6 | | 0 | 1 | 7 | | 1 | 2 | 1 | | 1 | 2 | 2 | | 1 | 2 | 3 | | 1 | 2 | 4 | | 1 | 2 | 5 | | 1 | 2 | 6 | +----+------+-------+ 

1 Answer 1

1

The list data should contain a list of rows, instead of a list of columns.

from pyspark.sql.types import StructField,StructType,StringType,IntegerType,ArrayType schema = StructType([ StructField('id', IntegerType(),True), StructField('arr', ArrayType(ArrayType(IntegerType(),True),True),True) ]) data = [ [1, [[0,1,2,3], [4,5,6,7]] ], [2, [[1,2,3,],[4,5,6]] ] ] df = spark.createDataFrame(data,schema) df.show(truncate=False) +---+----------------------------+ |id |arr | +---+----------------------------+ |1 |[[0, 1, 2, 3], [4, 5, 6, 7]]| |2 |[[1, 2, 3], [4, 5, 6]] | +---+----------------------------+ 

For exploding the arrays you can do this:

import pyspark.sql.functions as F df.withColumn('arr', F.explode(F.flatten('arr'))).show() +---+---+ | id|arr| +---+---+ | 1| 0| | 1| 1| | 1| 2| | 1| 3| | 1| 4| | 1| 5| | 1| 6| | 1| 7| | 2| 1| | 2| 2| | 2| 3| | 2| 4| | 2| 5| | 2| 6| +---+---+ 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.