0

I'm trying to read a json file using spark.read.json("<path>") but the column order is getting sorted by spark by default.

There are alot of nested columns/new columns getting added frequently to the schema and I can't define the schema for all the columns.

Is there any way where we can preserve column order while reading spark.read.json without defining schema manually?

Example:

json_str="""{"zip":"a","address":{"state":"la","pin":"1234","city":"go"},"street":"bar","building":"123"}""" spark.read.json(sc.parallelize([json_str])).printSchema() #root # |-- address: struct (nullable = true) # | |-- city: string (nullable = true) # | |-- pin: string (nullable = true) # | |-- state: string (nullable = true) # |-- building: string (nullable = true) # |-- street: string (nullable = true) # |-- zip: string (nullable = true) 

As you can see zip is first key in the source json string but Spark keeping the column as last one.

I tried of using schema_of_json and still column order is not preserving:

spark.sql("""select schema_of_json('{"zip":"a","address":{"state":"la","pin":"1234","city":"go"},"street":"bar","building":"123"}') as json_schema""").show(10,False) #+----------------------------------------------------------------------------------------------------+ #|json_schema | #+----------------------------------------------------------------------------------------------------+ #|struct<address:struct<city:string,pin:string,state:string>,building:string,street:string,zip:string>| #+----------------------------------------------------------------------------------------------------+ 

Please let me know if there any way we can preserve the order without defining the schema manually?

Thanks for the help!

1
  • If you're not passing schema to spark.read.json, Spark would take a sample from your JSON file and infer schema from it. How about you do the same thing manually to extract the schema dynamically, then use it? Commented May 16, 2021 at 5:43

1 Answer 1

2

You can use select to define the order

df = spark.read.json(sc.parallelize([json_str])) df.select("zip","address".....).show() 
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks Michael, Yes we can do .select But as mentioned in the question, I was looking for preserving order without manually including the schema!
Ah, ok. My guess is there's probably a way to do what you want but it's not worth the effort.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.