I have a nested source json file that contains an array of structs. The number of structs varies greatly from row to row and I would like to use Spark (scala) to dynamically create new dataframe columns from the key/values of the struct where the key is the column name and the value is the column value.
Example Minified json record
{"key1":{"key2":{"key3":"AK","key4":"EU","key5":{"key6":"001","key7":"N","values":[{"name":"valuesColumn1","value":"9.876"},{"name":"valuesColumn2","value":"1.2345"},{"name":"valuesColumn3","value":"8.675309"}]}}}} dataframe schema
scala> val df = spark.read.json("file:///tmp/nested_test.json") root |-- key1: struct (nullable = true) | |-- key2: struct (nullable = true) | | |-- key3: string (nullable = true) | | |-- key4: string (nullable = true) | | |-- key5: struct (nullable = true) | | | |-- key6: string (nullable = true) | | | |-- key7: string (nullable = true) | | | |-- values: array (nullable = true) | | | | |-- element: struct (containsNull = true) | | | | | |-- name: string (nullable = true) | | | | | |-- value: string (nullable = true) Whats been done so far
df.select( ($"key1.key2.key3").as("key3"), ($"key1.key2.key4").as("key4"), ($"key1.key2.key5.key6").as("key6"), ($"key1.key2.key5.key7").as("key7"), ($"key1.key2.key5.values").as("values")). show(truncate=false) +----+----+----+----+----------------------------------------------------------------------------+ |key3|key4|key6|key7|values | +----+----+----+----+----------------------------------------------------------------------------+ |AK |EU |001 |N |[[valuesColumn1, 9.876], [valuesColumn2, 1.2345], [valuesColumn3, 8.675309]]| +----+----+----+----+----------------------------------------------------------------------------+ There is an array of 3 structs here but the 3 structs need to be spilt into 3 separate columns dynamically (the number of 3 can vary greatly), and I am not sure how to do it.
Sample Desired output
Notice that there were 3 new columns produced for each of the array elements within the values array.
+----+----+----+----+-----------------------------------------+ |key3|key4|key6|key7|valuesColumn1|valuesColumn2|valuesColumn3| +----+----+----+----+-----------------------------------------+ |AK |EU |001 |N |9.876 |1.2345 |8.675309 | +----+----+----+----+-----------------------------------------+ Reference
I believe that the desired solution is something similar to what was discussed in this SO post but with 2 main differences:
- The number of columns is hardcoded to 3 in the SO post but in my circumstance, the number of array elements is unknown
- The column names need to be driven by the
namecolumn and the column value by thevalue.
... | | | | |-- element: struct (containsNull = true) | | | | | |-- name: string (nullable = true) | | | | | |-- value: string (nullable = true)