1

I have this API response:

{ '02/09/2021': { 'ABC': { 'emp': 'A1', 'value': '12421' }, 'DEF': { 'emp': 'D1', 'value': '3345' }, 'GHI': { 'emp': 'G2', 'value': '260048836600' }, 'JKL': { 'emp': 'J1', 'value': '66654654' } } } 

and would like to normalize a table in this format:

CODE | EMP | VALUE | DATE ======================================== ABC | A1 | 12421 | 02/09/2021 DEF | D1 | 3445 | 02/09/2021 GHI | G2 | 260048836600 | 02/09/2021 JKL | J1 | 66654654 | 02/09/2021 

I tried to make an explode but I couldn't do it, how can I get this result?


To reproduce it:

import json api_response = {'02/09/2021':{'ABC':{'emp':'A1','value':'12421'},'DEF':{'emp':'D1','value':'3345'},'GHI':{'emp':'G2','value':'260048836600'},'JKL':{'emp':'J1','value':'66654654'}}} rdd = spark.sparkContext.parallelize([json.dumps(api_response)]) input_df = spark.read.json(rdd) 
4
  • How are you reading it? Your JSON seems to be invalid Commented Sep 19, 2021 at 20:22
  • I use pyspark, this is a example… Commented Sep 19, 2021 at 20:52
  • yeah but I don't know how to read it. Update your question doing these two steps, first collecting some data from your dataframe running [dataframe].limit(3).collect() and get the schema [dataframe].schema.simpleString() where [dataframe] is your variable name. It will be easier to reproduce it Commented Sep 19, 2021 at 21:30
  • @Kafels done, tks! Commented Sep 19, 2021 at 21:56

1 Answer 1

2

The easier way to parse your api response into dataframe is first transforming the result, removing the key date and putting it into the body. To transforming it into a table it's required to use stack:

import json def transform_api_content(api_response): for key, value in api_response.items(): value['DATE'] = key yield json.dumps(value) api_response = {'02/09/2021':{'ABC':{'emp':'A1','value':'12421'},'DEF':{'emp':'D1','value':'3345'},'GHI':{'emp':'G2','value':'260048836600'},'JKL':{'emp':'J1','value':'66654654'}}} input_df = spark.read.json(sc.parallelize(transform_api_content(api_response))) # Hard-coded stack = 'stack(4, "ABC", ABC.emp, ABC.value, "DEF", DEF.emp, DEF.value, "GHI", GHI.emp, GHI.value, "JKL", JKL.emp, JKL.value) AS (CODE, EMP, VALUE)' output_df = input_df.selectExpr(stack, 'DATE') output_df.show(truncate=False) # Dynamic def stack_columns(dataframe): struct_cols = [column for column, schema in dataframe.dtypes if schema.startswith('struct')] stack_expr = ['"{0}", {0}.emp, {0}.value'.format(column) for column in struct_cols] return 'stack({length}, {stack}) AS (CODE, EMP, VALUE)'.format(length=len(struct_cols), stack=','.join(stack_expr)) output_df = input_df.selectExpr(stack_columns(input_df), 'DATE') output_df.show(truncate=False) 

Output

+----+---+------------+----------+ |CODE|EMP|VALUE |DATE | +----+---+------------+----------+ |ABC |A1 |12421 |02/09/2021| |DEF |D1 |3345 |02/09/2021| |GHI |G2 |260048836600|02/09/2021| |JKL |J1 |66654654 |02/09/2021| +----+---+------------+----------+ 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.