I'm reading a Hive table which has two columns, id and jsonString. I can easily transform the jsonString into a Spark Data Structure calling the spark.read.json function, but I have to add the column id as well.
val jsonStr1 = """{"fruits":[{"fruit":"banana"},{"fruid":"apple"},{"fruit":"pera"}],"bar":{"foo":"[\"daniel\",\"pedro\",\"thing\"]"},"daniel":"daniel data random","cars":["montana","bagulho"]}""" val jsonStr2 = """{"fruits":[{"dt":"banana"},{"fruid":"apple"},{"fruit":"pera"}],"bar":{"foo":"[\"daniel\",\"pedro\",\"thing\"]"},"daniel":"daniel data random","cars":["montana","bagulho"]}""" val jsonStr3 = """{"fruits":[{"a":"banana"},{"fruid":"apple"},{"fruit":"pera"}],"bar":{"foo":"[\"daniel\",\"pedro\",\"thing\"]"},"daniel":"daniel data random","cars":["montana","bagulho"]}""" case class Foo(id: Integer, json: String) val ds = Seq(new Foo(1,jsonStr1), new Foo(2,jsonStr2), new Foo(3,jsonStr3)).toDS val jsonDF = spark.read.json(ds.select($"json").rdd.map(r => r.getAs[String](0)).toDS) jsonDF.show() jsonDF.show +--------------------+------------------+------------------+--------------------+ | bar| cars| daniel| fruits| +--------------------+------------------+------------------+--------------------+ |[["daniel","pedro...|[montana, bagulho]|daniel data random|[[,,, banana], [,...| |[["daniel","pedro...|[montana, bagulho]|daniel data random|[[, banana,,], [,...| |[["daniel","pedro...|[montana, bagulho]|daniel data random|[[banana,,,], [,,...| +--------------------+------------------+------------------+--------------------+ I would like to add the column id from the Hive table, like this:
+--------------------+------------------+------------------+--------------------+--------------- | bar| cars| daniel| fruits| id +--------------------+------------------+------------------+--------------------+-------------- |[["daniel","pedro...|[montana, bagulho]|daniel data random|[[,,, banana], [,...|1 |[["daniel","pedro...|[montana, bagulho]|daniel data random|[[, banana,,], [,...|2 |[["daniel","pedro...|[montana, bagulho]|daniel data random|[[banana,,,], [,,...|3 +--------------------+------------------+------------------+--------------------+ I will not use regular expressions
I created a udf which take this two fields as argument and using a proper JSON library include the desired field(id) and return a new JSON string. It works like a charm but I hope Spark API offers a better way to do it. I'm using Apache Spark 2.3.0.