Scala: Incorrect element order when storing dataframe to CosmosDB in Databricks

Question

I am using the Apache Spark to Azure Cosmos DB connector to store a dataframe in scala to CosmosDB. This works but there is an odd thing with the element order in the document that is stored.

import org.apache.spark.sql.types._ val schema = new StructType() .add("a", StringType, true) .add("b", StringType, true) .add("c", new StructType() .add("d", StringType, true) .add("e", StringType, true) .add("f", StringType, true) ) val dataDS = Seq(""" { "a": "a", "b": "b", "c": { "d": "d", "e": "e", "f": "f" } }""").toDS() val df = spark.read.schema(schema).json(dataDS) println(df.printSchema()) df.write.mode(SaveMode.Overwrite).cosmosDB(writeConfig)

generates this document in CosmosDB

{ "a": "a", "b": "b", "c": { "d": "d", "e": "e", "f": "f" }, "id": "7c2ef8b9-86a6-4aa3-b190-d5083c885ea8", "_rid": ..... }

while this code

import org.apache.spark.sql.types._ val schema = new StructType() .add("a", StringType, true) .add("beta", StringType, true) <<----- .add("c", new StructType() .add("d", StringType, true) .add("echo", StringType, true) . <<----- .add("f", StringType, true) ) val dataDS = Seq(""" { "a": "a", "beta": "b", <<----- "c": { "d": "d", "echo": "e", <<----- "f": "f" } }""").toDS() val df = spark.read.schema(schema).json(dataDS) println(df.printSchema()) df.write.mode(SaveMode.Overwrite).cosmosDB(writeConfig)

generates this document in CosmosDB

{ "a": "a", "c": { "d": "d", "f": "f", "echo": "e" <<----- }, "id": "509c6c94-139a-4b73-a2dc-1ff424519adb", "beta": "b", <<----- "_rid": ..... }

Why is it so that the order of the elements are modified in the two examples. I want the structure of the document to be as in the first example. I am not sure why chaning e->echo and b->beta changes the document structure in CosmosDB.

Does anyone have an idea as to why this is happening and what can be done to solve it?

I'm not sure why it happens, but your data looks like a JSON and as such, order should not matter at all. It is any reason for you to be concerned about that, or it is not a JSON? - also, printSchema as the name says, it prints. Thus, you don't need to call println(df.printSchema) it will print the Unit element which is (). — Luis Miguel Mejía Suárez
– Luis Miguel Mejía Suárez, Commented Jan 2, 2019 at 15:41

mauridb · Accepted Answer · 2019-01-03 01:48:49Z

Cosmos DB doesn't really store JSON as JSON, but it decomposes it into ARS structure:

https://azure.microsoft.com/en-us/blog/a-technical-overview-of-azure-cosmos-db/

When you ask for a JSON then it recreates it on the fly. More details on internal representation can be found here in the "Documents as Trees" section:

http://www.vldb.org/pvldb/vol8/p1668-shukla.pdf

Why preserving the order of properties is important for you?

Collectives™ on Stack Overflow

Scala: Incorrect element order when storing dataframe to CosmosDB in Databricks

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related