3

I have a parquet file with multiple columns and out of those I have 2 columns which are JSON/Struct, but their type is string. There can be any number of array_elements present.

{ "addressline": [ { "array_element": "F748DK’8U1P9’2ZLKXE" }, { "array_element": "’O’P0BQ04M-" }, { "array_element": "’fvrvrWEM-" } ], "telephone": [ { "array_element": { "locationtype": "8.PLT", "countrycode": null, "phonenumber": "000000000", "phonetechtype": "1.PTT", "countryaccesscode": null, "phoneremark": null } } ] } 

How can I create a schema to handle these columns in PySpark?

0

1 Answer 1

6

Treating the example you provided as string I have created this dataframe:

from pyspark.sql import functions as F, types as T df = spark.createDataFrame([('{"addressline":[{"array_element":"F748DK’8U1P9’2ZLKXE"},{"array_element":"’O’P0BQ04M-"},{"array_element":"’fvrvrWEM-"}],"telephone":[{"array_element":{"locationtype":"8.PLT","countrycode":null,"phonenumber":"000000000","phonetechtype":"1.PTT","countryaccesscode":null,"phoneremark":null}}]}',)], ['c1']) 

This is a schema to be applied to this column:

schema = T.StructType([ T.StructField('addressline', T.ArrayType(T.StructType([ T.StructField('array_element', T.StringType()) ]))), T.StructField('telephone', T.ArrayType(T.StructType([ T.StructField('array_element', T.StructType([ T.StructField('locationtype', T.StringType()), T.StructField('countrycode', T.StringType()), T.StructField('phonenumber', T.StringType()), T.StructField('phonetechtype', T.StringType()), T.StructField('countryaccesscode', T.StringType()), T.StructField('phoneremark', T.StringType()), ])) ]))) ]) 

Results providing the schema to the from_json function:

df = df.withColumn('c1', F.from_json('c1', schema)) df.show() # +-------------------------------------------------------------------------------------------------------+ # |c1 | # +-------------------------------------------------------------------------------------------------------+ # |{[{F748DK’8U1P9’2ZLKXE}, {’O’P0BQ04M-}, {’fvrvrWEM-}], [{{8.PLT, null, 000000000, 1.PTT, null, null}}]}| # +-------------------------------------------------------------------------------------------------------+ df.printSchema() # root # |-- c1: struct (nullable = true) # | |-- addressline: array (nullable = true) # | | |-- element: struct (containsNull = true) # | | | |-- array_element: string (nullable = true) # | |-- telephone: array (nullable = true) # | | |-- element: struct (containsNull = true) # | | | |-- array_element: struct (nullable = true) # | | | | |-- locationtype: string (nullable = true) # | | | | |-- countrycode: string (nullable = true) # | | | | |-- phonenumber: string (nullable = true) # | | | | |-- phonetechtype: string (nullable = true) # | | | | |-- countryaccesscode: string (nullable = true) # | | | | |-- phoneremark: string (nullable = true) 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.