I am using the Apache Spark to Azure Cosmos DB connector to store a dataframe in scala to CosmosDB. This works but there is an odd thing with the element order in the document that is stored.
import org.apache.spark.sql.types._ val schema = new StructType() .add("a", StringType, true) .add("b", StringType, true) .add("c", new StructType() .add("d", StringType, true) .add("e", StringType, true) .add("f", StringType, true) ) val dataDS = Seq(""" { "a": "a", "b": "b", "c": { "d": "d", "e": "e", "f": "f" } }""").toDS() val df = spark.read.schema(schema).json(dataDS) println(df.printSchema()) df.write.mode(SaveMode.Overwrite).cosmosDB(writeConfig) generates this document in CosmosDB
{ "a": "a", "b": "b", "c": { "d": "d", "e": "e", "f": "f" }, "id": "7c2ef8b9-86a6-4aa3-b190-d5083c885ea8", "_rid": ..... } while this code
import org.apache.spark.sql.types._ val schema = new StructType() .add("a", StringType, true) .add("beta", StringType, true) <<----- .add("c", new StructType() .add("d", StringType, true) .add("echo", StringType, true) . <<----- .add("f", StringType, true) ) val dataDS = Seq(""" { "a": "a", "beta": "b", <<----- "c": { "d": "d", "echo": "e", <<----- "f": "f" } }""").toDS() val df = spark.read.schema(schema).json(dataDS) println(df.printSchema()) df.write.mode(SaveMode.Overwrite).cosmosDB(writeConfig) generates this document in CosmosDB
{ "a": "a", "c": { "d": "d", "f": "f", "echo": "e" <<----- }, "id": "509c6c94-139a-4b73-a2dc-1ff424519adb", "beta": "b", <<----- "_rid": ..... } Why is it so that the order of the elements are modified in the two examples. I want the structure of the document to be as in the first example. I am not sure why chaning e->echo and b->beta changes the document structure in CosmosDB.
Does anyone have an idea as to why this is happening and what can be done to solve it?
printSchemaas the name says, it prints. Thus, you don't need to callprintln(df.printSchema)it will print the Unit element which is().