I have multiple source JSON files where each JSON file will have different schemas present in it.
For example below 3 Json files, I've presented it in a tabular view for better understanding, but I've also provided the schema as well:
Json File 1:
| Key | Value | Column_Name |
|---|---|---|
sample_column.pull.notify.roid.alert | aaa | Column A |
sample_column.pull.notify.roid.title | bbb | Column B |
sample_column.pull.notify.roid.action.pan.content | ccc | Column C |
JSON File 1 Schema:
|-- sample_column: struct (nullable = true) | |-- pull: struct (nullable = true) | | |-- notify: struct (nullable = true) | | | |-- roid: struct (nullable = true) | | | | |-- action: struct (nullable = true) | | | | | |-- pan: struct (nullable = true) | | | | | | |-- content: string (nullable = true) | | | | |-- alert: string (nullable = true) | | | | |-- title: string (nullable = true) Json File 2:
| Key | Value | Column_Name |
|---|---|---|
sample_column.cone.pull.notify.roid.alert | a1a1a1 | Column A |
sample_column.cone.pull.notify.roid.title | b1b1b1 | Column B |
sample_column.cone.pull.notify.roid.action.pan.content | c1c1c1 | Column C |
JSON File 2 Schema:
|-- sample_column: struct (nullable = true) | |-- cone: struct (nullable = true) | | |-- pull: struct (nullable = true) | | | |-- notify: struct (nullable = true) | | | | |-- roid: struct (nullable = true) | | | | | |-- action: struct (nullable = true) | | | | | | |-- pan: struct (nullable = true) | | | | | | | |-- content: string (nullable = true) | | | | | |-- alert: string (nullable = true) | | | | | |-- title: string (nullable = true) Json File 3:
| Key | Value | Column_Name |
|---|---|---|
sample_column.var.pull.notify.roid.alert | a2a2a2 | Column A |
sample_column.var.pull.notify.roid.title | b2b2b2 | Column B |
sample_column.var.pull.notify.roid.action.pan.content | c2c2c2 | Column C |
JSON File 3 Schema:
|-- sample_column: struct (nullable = true) | |-- var: struct (nullable = true) | | |-- pull: struct (nullable = true) | | | |-- notify: struct (nullable = true) | | | | |-- roid: struct (nullable = true) | | | | | |-- action: struct (nullable = true) | | | | | | |-- pan: struct (nullable = true) | | | | | | | |-- content: string (nullable = true) | | | | | |-- alert: string (nullable = true) | | | | | |-- title: string (nullable = true) I need to get "alert", "title" and "content" from those 3 different schemas and put all the "Alert" values under Column A, then all the "Title" values under Column B and all the "Content" value under Column C
This will be the expected output Table:
| Column A | Column B | Column C |
|---|---|---|
| aaa | bbb | ccc |
| a1a1a1 | b1b1b1 | c1c1c1 |
| a2a2a2 | b2b2b2 | c2c2c2 |
So, as the above scenario I tried to append all different schema like all ('a') values under column A then ('b') values under column B and ('c') values under column C but the values are getting overwritten like below:
| Column A | Column B | Column C |
|---|---|---|
| a2a2a2 | b2b2b2 | c2c2c2 |
Below is the Pyspark Code that I've tried which overwrites the values but not appending it:
def has_column(df, col): # Checks whether the dataframe has a column, if the dataframe has a column, returns True else False. try: df[col] return True except: return False if has_column(df, "sample_column.pull.notify.roid.alert"): try: df = df.withColumn('Column A', when(df.sample_column.pull.notify.roid.alert.isNotNull(), df["sample_column.pull.notify.roid.alert"])) except: df = df.withColumn('Column A', lit("").cast("string")) if has_column(df, "sample_column.cone.pull.notify.roid.title"): try: df = df.withColumn('Column A', when(df.sample_column.cone.pull.notify.roid.title.isNotNull(), df["sample_column.cone.pull.notify.roid.title"])) except: df = df.withColumn('Column A', lit("").cast("string")) if has_column(df, "sample_column.var.pull.notify.roid.action.pan.content"): try: df = df.withColumn('Column A', when(df.sample_column.var.pull.notify.roid.action.pan.content.isNotNull(), df["sample_column.var.pull.notify.roid.action.pan.content"])) except: df = df.withColumn('Column A', lit("").cast("string")) Help is much appreciated!
df.printSchema()? it is a little hard to understand whether you mean the key is the nested object or literally your key has "." in it.