6

I have a column, which is of type array < Struct > deduced from json file. I want to convert the array < Struct > into string, so that i can keep this array column as-is in hive and export it to RDBMS as a single column.

temp.json

{"properties":{"items":[{"invoicid":{"value":"923659"},"job_id": {"value":"296160"},"sku_id": {"value":"312002"}}],"user_id":"6666","zip_code":"666"}} 

Processing :

scala> val temp = spark.read.json("s3://check/1/temp1.json") temp: org.apache.spark.sql.DataFrame = [properties: struct<items: array<struct<invoicid:struct<value:string>,job_id:struct<value:string>,sku_id:struct<value:string>>>, user_id: string ... 1 more field>] scala> temp.printSchema root |-- properties: struct (nullable = true) | |-- items: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- invoicid: struct (nullable = true) | | | | |-- value: string (nullable = true) | | | |-- job_id: struct (nullable = true) | | | | |-- value: string (nullable = true) | | | |-- sku_id: struct (nullable = true) | | | | |-- value: string (nullable = true) | |-- user_id: string (nullable = true) | |-- zip_code: string (nullable = true) scala> temp.select("properties").show +--------------------+ | properties| +--------------------+ |[WrappedArray([[9...| +--------------------+ scala> temp.select("properties.items").show +--------------------+ | items| +--------------------+ |[[[923659],[29616...| +--------------------+ scala> temp.createOrReplaceTempView("tempTable") scala> spark.sql("select properties.items from tempTable").show +--------------------+ | items| +--------------------+ |[[[923659],[29616...| +--------------------+ 

How can i get the result like:

+-----------------------------------------------------------------------------------------+ | items | +-----------------------------------------------------------------------------------------+ [{"invoicid":{"value":"923659"},"job_id":{"value":"296160"},"sku_id":{"value":"312002"}}] | +-----------------------------------------------------------------------------------------+ 

to get the array element value without any change.

1
  • [{"invoicid":{"value":"923659"},"job_id":{"value":"296160"},"sku_id":{"value":"312002"}}] Commented Mar 1, 2019 at 12:52

1 Answer 1

17

to_json is the function you're looking for

import org.apache.spark.sql.functions.to_json: val df = spark.read.json(sc.parallelize(Seq(""" {"properties":{"items":[{"invoicid":{"value":"923659"},"job_id": {"value":"296160"},"sku_id": {"value":"312002"}}],"user_id":"6666","zip_code":"666"}}"""))) df .select(get_json_object(to_json($"properties"), "$.items").alias("items")) .show(false) 
+-----------------------------------------------------------------------------------------+ |items | +-----------------------------------------------------------------------------------------+ |[{"invoicid":{"value":"923659"},"job_id":{"value":"296160"},"sku_id":{"value":"312002"}}]| +-----------------------------------------------------------------------------------------+ 
Sign up to request clarification or add additional context in comments.

2 Comments

How would you extract all the columns attached to the root struct? For eg, if "properties" did not exist, I was hoping select(get_json_object(to_json(($".*")),"$.value")) would work. But it doesnt.
to_json(struct(df.columns map col: _*))

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.