I'm trying to read a json file using spark.read.json("<path>") but the column order is getting sorted by spark by default.
There are alot of nested columns/new columns getting added frequently to the schema and I can't define the schema for all the columns.
Is there any way where we can preserve column order while reading spark.read.json without defining schema manually?
Example:
json_str="""{"zip":"a","address":{"state":"la","pin":"1234","city":"go"},"street":"bar","building":"123"}""" spark.read.json(sc.parallelize([json_str])).printSchema() #root # |-- address: struct (nullable = true) # | |-- city: string (nullable = true) # | |-- pin: string (nullable = true) # | |-- state: string (nullable = true) # |-- building: string (nullable = true) # |-- street: string (nullable = true) # |-- zip: string (nullable = true) As you can see zip is first key in the source json string but Spark keeping the column as last one.
I tried of using schema_of_json and still column order is not preserving:
spark.sql("""select schema_of_json('{"zip":"a","address":{"state":"la","pin":"1234","city":"go"},"street":"bar","building":"123"}') as json_schema""").show(10,False) #+----------------------------------------------------------------------------------------------------+ #|json_schema | #+----------------------------------------------------------------------------------------------------+ #|struct<address:struct<city:string,pin:string,state:string>,building:string,street:string,zip:string>| #+----------------------------------------------------------------------------------------------------+ Please let me know if there any way we can preserve the order without defining the schema manually?
Thanks for the help!
spark.read.json, Spark would take a sample from your JSON file and infer schema from it. How about you do the same thing manually to extract the schema dynamically, then use it?