Link between scala object, dataset and dataframe [duplicate]

Question

I have created the following case class:

case class Data(ads:Option[Ads]) case class Ads(subject: Option[String] , body:Option[String] , price:Option[Int] , location:Option[Location] , attribut:Option[Seq[Attribut]] ) case class Location(city:Option[String] , zipcode:Option[String]) case class Attribut(key_label:Option[String] , value_label:Option[String] )

And I parse a JSON format (part of a HTML) with play framework.

I finally obtain an Object Ads

JsSuccess(Ads(Some("Subject"), SOme("Body"), Some(Price), Some(Location(Some("City"), Some("Zipcode")), Some(Attribut("key_label", "value_label"))

I want to save this in a CSV file in the following way:

Subject Body Price City Zipcode Key_Label Value_Label Play Playing games 532 Geneve 95 GEN Gen2

I convert the object into a List of Ads(Some("Subject"), Some("Body"), Some(Price), Some(Location(Some("City"), Some("Zipcode")), Some(Attribut("key_label", "value_label") and convert this list into a DataFrame.

But I have only one column Value which contains all the elements of the object.

 Value (Some("Subject"), SOme("Body"), Some(Price), Some(Location(Some("City"), Some("Zipcode")), Some(Attribut("key_label", "value_label")

Has someone got an idea please ? I don't really understand how to link scala object with dataset and dataframe. Thank you for your help.

How did you convert the List into the DataFrame? Also, I believe there would be better alternatives for writing a local CSV file rather than pulling out Spark. — Luis Miguel Mejía Suárez
– Luis Miguel Mejía Suárez, Commented Sep 27, 2020 at 16:09
Its unclear why you would want to use Spark with single JSON file. If you do df.printSchema() it probably shows it has single column Value which is a struct. If yes, you just need to unpack/ flatten it. You can unpack by using .withColunn(...,...) or flatten as in the above example — Sai
– Sai, Commented Sep 27, 2020 at 20:01
Thank you for your answers. I finally think that it's not necessary to use Spark. I want the convert JsSuccess(Ads(Some("Subject"), SOme("Body"), Some(Price), Some(Location(Some("City"), Some("Zipcode")), Some(Attribut("key_label", "value_label")) into a CSV file with 7 columns : Subject, Body, Price, City, ZIpcode, Key_Label and Key_value. Any ideas ? — fleur
– fleur, Commented Sep 29, 2020 at 11:53
I did like this: implicit class CSVWrapper(val prod: Product) extends AnyVal { def toCSV: String = prod.productIterator.map{ case p: Product => p.toCSV case rest => rest }.mkString("|") } — fleur
– fleur, Commented Oct 5, 2020 at 8:46

ELinda · Accepted Answer · 2020-09-28 05:37:45Z

Comments are helpful, but the generic flattening function may not output columns in the desired order, and/or handle putting array elements into their own separate columns.

Supposing your JSON file contains lines such as:

{"ads": {"subject": "abc", "body": "doing something", "price": 13, "location": {"city": "Houston", "zipcode": 39014}, "attribut": [{"key_label": "a", "value_label": "b"}]}}

If the file is fairly consistent and you're already including Spark as a dependency, you probably do not need to use a separate library to parse the JSON.

You will need to use the explode function to handle the fact that the "attribut" column is a list. Use the explode_outer function instead, if the lists may be empty but you would like to preserve other columns' values.

import org.apache.spark.sql.functions._ // assuming spark is the Spark Session val df = spark.read.json("mydata.json") val df1 = df.select(col("ads.subject").alias("Subject"), col("ads.body").alias("Body"), col("ads.location.city").alias("City"), col("ads.location.zipcode").alias("Zipcode"), explode(col("ads.attribut"))) val resultDF = df1.select(col("Subject"), col("Body"), col("City"), col("Zipcode"), col("col.key_label"), col("col.value_label"))

resultDF.show would output:

+-------+---------------+-------+-------+---------+-----------+ |Subject| Body| City|Zipcode|key_label|value_label| +-------+---------------+-------+-------+---------+-----------+ | abc|doing something|Houston| 39014| a| b| +-------+---------------+-------+-------+---------+-----------+

To output as a single CSV file in the specified directory, with headers:

resultDF.repartition(1).write.option("header", "true").csv("/tmp/my-output-dir/")

Thank you for your answers. I finally think that it's not necessary to use Spark. I want the convert JsSuccess(Ads(Some("Subject"), SOme("Body"), Some(Price), Some(Location(Some("City"), Some("Zipcode")), Some(Attribut("key_label", "value_label")) into a CSV file with 7 columns : Subject, Body, Price, City, ZIpcode, Key_Label and Key_value. Any ideas ?

Collectives™ on Stack Overflow

Link between scala object, dataset and dataframe [duplicate]

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related