5

How can i flatten array into dataframe that contain colomns [a,b,c,d,e]

root |-- arry: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- a string (nullable = true) | | |-- b: long (nullable = true) | | |-- c: string (nullable = true) | | |-- d: string (nullable = true) | | |-- e: long (nullable = true) 

Any help is appreciated.

1 Answer 1

11

Say, you have a json with the following structure:

{ "array": [ { "a": "asdf", "b": 1234, "c": "a", "d": "str", "e": 1234 }, { "a": "asdf", "b": 1234, "c": "a", "d": "str", "e": 1234 }, { "a": "asdf", "b": 1234, "c": "a", "d": "str", "e": 1234 } ] } 
  1. Read the file
scala> val nested = spark.read.option("multiline",true).json("nested.json") nested: org.apache.spark.sql.DataFrame = [array: array<struct<a:string,b:bigint,c:string,d:string,e:bigint>>] 
  1. Check the schema
scala> nested.printSchema root |-- array: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- a: string (nullable = true) | | |-- b: long (nullable = true) | | |-- c: string (nullable = true) | | |-- d: string (nullable = true) | | |-- e: long (nullable = true) 
  1. Use explode function
scala> nested.select(explode($"array").as("exploded")).select("exploded.*").show +----+----+---+---+----+ | a| b| c| d| e| +----+----+---+---+----+ |asdf|1234| a|str|1234| |asdf|1234| a|str|1234| |asdf|1234| a|str|1234| +----+----+---+---+----+ 
Sign up to request clarification or add additional context in comments.

Comments