0

I have JSON files describing a table structure. I want to read each file from S3 as a single String in order to then apply a fromJson() method of apache.spark.sql.types.DataType

DataType.fromJson(jsonString).asInstanceOf[StructType] 

But for now I only managed to read the files into a DataFrame:

 val testJsonData = sqlContext.read.option("multiline", "true").json("/s3Bucket/metrics/metric1.json") 

But I don't need a df.schema, instead I need to parse the contents of a JSON string to a StructType.

The contents of a JSON file:

{ "type" : "struct", "fields" : [ { "name" : "metric_name", "type" : "string", "nullable" : true, "metadata" : { } }, { "name" : "metric_time", "type" : "long", "nullable" : true, "metadata" : { } }, { "name" : "metric_value", "type" : "string", "nullable" : true, "metadata" : { } }] } 
1

1 Answer 1

1

It looks like what you want to use is sc.wholeTextFiles (sc is a SparkContext in this case).

This results in an RDD[(String, String)] where ._1 is the file name, and ._2 is the entire file content. Maybe you can try:

val files = sc.wholeTextFiles("/s3Bucket/metrics/", 16).toDS() files.map(DataType.fromJson(_._2).asInstanceOf[StructType]) 

Which, in theory, would give you an Dataset[StructType]. Unfortunately, I'm not finding a similar function in the pure spark sql API, but this may work.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.