1

I am using the Apache Spark to Azure Cosmos DB connector to store a dataframe in scala to CosmosDB. This works but there is an odd thing with the element order in the document that is stored.

import org.apache.spark.sql.types._ val schema = new StructType() .add("a", StringType, true) .add("b", StringType, true) .add("c", new StructType() .add("d", StringType, true) .add("e", StringType, true) .add("f", StringType, true) ) val dataDS = Seq(""" { "a": "a", "b": "b", "c": { "d": "d", "e": "e", "f": "f" } }""").toDS() val df = spark.read.schema(schema).json(dataDS) println(df.printSchema()) df.write.mode(SaveMode.Overwrite).cosmosDB(writeConfig) 

generates this document in CosmosDB

{ "a": "a", "b": "b", "c": { "d": "d", "e": "e", "f": "f" }, "id": "7c2ef8b9-86a6-4aa3-b190-d5083c885ea8", "_rid": ..... } 

while this code

import org.apache.spark.sql.types._ val schema = new StructType() .add("a", StringType, true) .add("beta", StringType, true) <<----- .add("c", new StructType() .add("d", StringType, true) .add("echo", StringType, true) . <<----- .add("f", StringType, true) ) val dataDS = Seq(""" { "a": "a", "beta": "b", <<----- "c": { "d": "d", "echo": "e", <<----- "f": "f" } }""").toDS() val df = spark.read.schema(schema).json(dataDS) println(df.printSchema()) df.write.mode(SaveMode.Overwrite).cosmosDB(writeConfig) 

generates this document in CosmosDB

{ "a": "a", "c": { "d": "d", "f": "f", "echo": "e" <<----- }, "id": "509c6c94-139a-4b73-a2dc-1ff424519adb", "beta": "b", <<----- "_rid": ..... } 

Why is it so that the order of the elements are modified in the two examples. I want the structure of the document to be as in the first example. I am not sure why chaning e->echo and b->beta changes the document structure in CosmosDB.

Does anyone have an idea as to why this is happening and what can be done to solve it?

1
  • I'm not sure why it happens, but your data looks like a JSON and as such, order should not matter at all. It is any reason for you to be concerned about that, or it is not a JSON? - also, printSchema as the name says, it prints. Thus, you don't need to call println(df.printSchema) it will print the Unit element which is (). Commented Jan 2, 2019 at 15:41

1 Answer 1

2

Cosmos DB doesn't really store JSON as JSON, but it decomposes it into ARS structure:

https://azure.microsoft.com/en-us/blog/a-technical-overview-of-azure-cosmos-db/

When you ask for a JSON then it recreates it on the fly. More details on internal representation can be found here in the "Documents as Trees" section:

http://www.vldb.org/pvldb/vol8/p1668-shukla.pdf

Why preserving the order of properties is important for you?

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.