By having this structure I believe you are hiding/wrapping the really useful information of your data. The only useful information here is: {"customerId":"1","name":"a"},{"customerId":"2","name":"b"} customers along with datum will just hide the data that you really need. In order to access the data right now you must 1st slightly change your data to:
{"customers":[{"customerId":"1","name":"a"},{"customerId":"2","name":"b"}]}
And then access this JSON with the next code:
case class Customer(customerId:String, name:String) case class Data(customers: Array[Customer]) val df = spark.read.json(path).as[Data]
If try to print this dataframe you get:
+----------------+ | customers| +----------------+ |[[1, a], [2, b]]| +----------------+
which of course is your data wrapped into arrays. Now comes the interesting part, in order to access this you must do something as the following:
df.foreach{ data => data.customers.foreach(println _) }
This will print:
Customer(1,a) Customer(2,b)
which is the real data that you need but not easily accessed at all.
EDIT:
Instead of using 2 classes I would use just one, the Customer class. Then leverage the build-in Spark filters for selecting inner JSON objects. Finally you can explode each array of customers and generate from the exploded column a strongly type dataset of class Customer.
Here is the final code:
case class Customer(customerId:String, name:String) val path = "C:\\temp\\json_data.json" val df = spark.read.json(path) df.select(explode($"data.customers")) .map{ r => Customer(r.getStruct(0).getString(0), r.getStruct(0).getString(1))} .show(false)
And the output:
+----------+----+ |customerId|name| +----------+----+ |1 |a | |2 |b | +----------+----+
inputJson.as[Customer].mapPartitions(partition => { List(Datum(Some(Customers(Some(partition.toList)))))) })This should do what you need