0

I have a column of json strings and would like to be able to convert them to structs, similar to how SQLContext.read.json() will make that transformation on initial read from the file.

Alternatively, is there a way to nest my DataFrames? I could do that as well.

1
  • I ended up going an entirely different direction. I was hoping that there was some kind of method in place where I could transform a json string to a struct for each row in a column. Commented Jul 25, 2016 at 19:49

3 Answers 3

1

Spark does not support dataframe (or dataset or RDD) nesting.

You can break down your problem into two separate steps.

First, you need to parse JSON and build a case class consisting entirely of types Spark supports. This problem has nothing to do with Spark so let's assume you've coded this as:

 def buildMyCaseClass(json: String): MyCaseClass = { ... } 

Then, you need to transform your dataframe such that the string column becomes a struct column. The easiest way to do this is via a UDF.

 val builderUdf = udf(buildMyCaseClass _) df.withColumn("myCol", builderUdf('myCol)) 
Sign up to request clarification or add additional context in comments.

2 Comments

I figured that this might be the way to go, but I was hoping there'd be some functionality in place for it because initially before encrypting my data I'd use sqlContext.read.json("path") on a file with nested json and it would have a struct column, like you said, I was just hoping that schema inference would be available through some kind of method instead of having to create my own parser.
JSON schema inference is buried in the guts of Spark. I don't agree with the design decision but I do understand it as it requires two passes through the data (discovery pass and merging/union pass). If you are very motivated, you can register your own classes in the Spark package namespace and gain access to this functionality but I would not recommend it. You can always write the JSON and then read it with automatic schema discovery as long as you add an extra column to join to your existing data, e.g., by transforming before write to `"{joinKey: ..., jsonCol: ...}". Slower but reliable.
0

Spark SQL provides functions like to_json() to encode a struct as a string and from_json() to retrieve the struct as a complex type.

{ "a": "{\"b\":1}" } val schema = new StructType().add("b", IntegerType) events.select(from_json('a, schema) as 'c) // output { "c": { "b": 1 } } 

You can read more at https://spark.apache.org/docs/2.2.2/api/java/org/apache/spark/sql/functions.html#from_json-org.apache.spark.sql.Column-org.apache.spark.sql.types.DataType-

Comments

0

On recent versions of spark, if you have your JSON in:

Dataset[String] 

You can do:

spark.read.json(theJsonStringDataset) 

From the docs for DataFrameReader:

def json(jsonDataset: Dataset[String]): DataFrame 

Loads a Dataset[String] storing JSON objects (JSON Lines text format or newline-delimited JSON) and returns the result as a DataFrame.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.