Columns order not preserved after reading spark Json file

Question

I'm trying to read a json file using spark.read.json("<path>") but the column order is getting sorted by spark by default.

There are alot of nested columns/new columns getting added frequently to the schema and I can't define the schema for all the columns.

Is there any way where we can preserve column order while reading spark.read.json without defining schema manually?

Example:

json_str="""{"zip":"a","address":{"state":"la","pin":"1234","city":"go"},"street":"bar","building":"123"}""" spark.read.json(sc.parallelize([json_str])).printSchema() #root # |-- address: struct (nullable = true) # | |-- city: string (nullable = true) # | |-- pin: string (nullable = true) # | |-- state: string (nullable = true) # |-- building: string (nullable = true) # |-- street: string (nullable = true) # |-- zip: string (nullable = true)

As you can see zip is first key in the source json string but Spark keeping the column as last one.

I tried of using schema_of_json and still column order is not preserving:

spark.sql("""select schema_of_json('{"zip":"a","address":{"state":"la","pin":"1234","city":"go"},"street":"bar","building":"123"}') as json_schema""").show(10,False) #+----------------------------------------------------------------------------------------------------+ #|json_schema | #+----------------------------------------------------------------------------------------------------+ #|struct<address:struct<city:string,pin:string,state:string>,building:string,street:string,zip:string>| #+----------------------------------------------------------------------------------------------------+

Please let me know if there any way we can preserve the order without defining the schema manually?

Thanks for the help!

If you're not passing schema to spark.read.json, Spark would take a sample from your JSON file and infer schema from it. How about you do the same thing manually to extract the schema dynamically, then use it? — pltc
– pltc, Commented May 16, 2021 at 5:43

Michael Westblade · Accepted Answer · 2021-04-28 01:42:15Z

2

You can use select to define the order

df = spark.read.json(sc.parallelize([json_str])) df.select("zip","address".....).show()

answered Apr 28, 2021 at 1:42

Michael Westblade

614 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ashley Over a year ago

Thanks Michael, Yes we can do .select But as mentioned in the question, I was looking for preserving order without manually including the schema!

Michael Westblade Over a year ago

Ah, ok. My guess is there's probably a way to do what you want but it's not worth the effort.

Collectives™ on Stack Overflow

Columns order not preserved after reading spark Json file

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related