Transform a column of json strings to structs

Question

I have a column of json strings and would like to be able to convert them to structs, similar to how SQLContext.read.json() will make that transformation on initial read from the file.

Alternatively, is there a way to nest my DataFrames? I could do that as well.

I ended up going an entirely different direction. I was hoping that there was some kind of method in place where I could transform a json string to a struct for each row in a column. — Brady Auen
– Brady Auen, Commented Jul 25, 2016 at 19:49

Sim · Accepted Answer · 2016-07-25 19:25:30Z

1

Spark does not support dataframe (or dataset or RDD) nesting.

You can break down your problem into two separate steps.

First, you need to parse JSON and build a case class consisting entirely of types Spark supports. This problem has nothing to do with Spark so let's assume you've coded this as:

 def buildMyCaseClass(json: String): MyCaseClass = { ... }

Then, you need to transform your dataframe such that the string column becomes a struct column. The easiest way to do this is via a UDF.

 val builderUdf = udf(buildMyCaseClass _) df.withColumn("myCol", builderUdf('myCol))

answered Jul 25, 2016 at 19:25

Sim

13.6k11 gold badges69 silver badges96 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Brady Auen Over a year ago

I figured that this might be the way to go, but I was hoping there'd be some functionality in place for it because initially before encrypting my data I'd use sqlContext.read.json("path") on a file with nested json and it would have a struct column, like you said, I was just hoping that schema inference would be available through some kind of method instead of having to create my own parser.

Sim Over a year ago

JSON schema inference is buried in the guts of Spark. I don't agree with the design decision but I do understand it as it requires two passes through the data (discovery pass and merging/union pass). If you are very motivated, you can register your own classes in the Spark package namespace and gain access to this functionality but I would not recommend it. You can always write the JSON and then read it with automatic schema discovery as long as you add an extra column to join to your existing data, e.g., by transforming before write to `"{joinKey: ..., jsonCol: ...}". Slower but reliable.

ravi malhotra · Accepted Answer · 2020-09-25 08:03:52Z

Spark SQL provides functions like to_json() to encode a struct as a string and from_json() to retrieve the struct as a complex type.

{ "a": "{\"b\":1}" } val schema = new StructType().add("b", IntegerType) events.select(from_json('a, schema) as 'c) // output { "c": { "b": 1 } }

You can read more at https://spark.apache.org/docs/2.2.2/api/java/org/apache/spark/sql/functions.html#from_json-org.apache.spark.sql.Column-org.apache.spark.sql.types.DataType-

Nick · Accepted Answer · 2020-09-26 00:07:53Z

On recent versions of spark, if you have your JSON in:

Dataset[String]

You can do:

spark.read.json(theJsonStringDataset)

From the docs for DataFrameReader:

def json(jsonDataset: Dataset[String]): DataFrame 
Loads a Dataset[String] storing JSON objects (JSON Lines text format or newline-delimited JSON) and returns the result as a DataFrame.

Collectives™ on Stack Overflow

Transform a column of json strings to structs

3 Answers 3

2 Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Related