I'm attempting to create some sample data frames to put test against some functions, regularly I get JSON objects with nested objects(arrays or more JSON objects), I need to test these for differing types, namely Struct and Array and pass them to a correct function based on their type to create a tabular dataframe.
These objects are from APIs some internal, some external so I'm at the mercy of app developers.
Let's assume I want to create a table as follows to test against:
+----+------+------------------------------+ | | id | arr | |----+------+------------------------------| | 0 | 1 | [[0, 1, 2, 3], [4, 5, 6, 7]] | | 1 | 2 | [[1, 2, 3], [4, 5, 6]] | +----+------+------------------------------+ My assumption would be I would need to create a schema as follows:
from pyspark.sql.types import StructField,StructType,StringType,IntegerType,ArrayType schema = StructType([ StructField('id', IntegerType(),True), StructField('arr', ArrayType(ArrayType(IntegerType(),True),True),True) ]) data = [ [1,2], #< id. [[ [0,1,2,3], [4,5,6,7]], # < arr [[1,2,3,],[4,5,6]]] ] df = spark.createDataFrame(data,schema) which returns a TypeError:
field arr: ArrayType(IntegerType,true) can not accept object 2 in type <class 'int'> Where have I made an error?
When all is said and done this the output I will get when I've passed these through a recursive function:
+----+------+-------+ | | id | arr | |----+------+-------| | 0 | 1 | 0 | | 0 | 1 | 1 | | 0 | 1 | 2 | | 0 | 1 | 3 | | 0 | 1 | 4 | | 0 | 1 | 5 | | 0 | 1 | 6 | | 0 | 1 | 7 | | 1 | 2 | 1 | | 1 | 2 | 2 | | 1 | 2 | 3 | | 1 | 2 | 4 | | 1 | 2 | 5 | | 1 | 2 | 6 | +----+------+-------+