Pyspark - normalize a dataframe

Question

I have this API response:

{ '02/09/2021': { 'ABC': { 'emp': 'A1', 'value': '12421' }, 'DEF': { 'emp': 'D1', 'value': '3345' }, 'GHI': { 'emp': 'G2', 'value': '260048836600' }, 'JKL': { 'emp': 'J1', 'value': '66654654' } } }

and would like to normalize a table in this format:

CODE | EMP | VALUE | DATE ======================================== ABC | A1 | 12421 | 02/09/2021 DEF | D1 | 3445 | 02/09/2021 GHI | G2 | 260048836600 | 02/09/2021 JKL | J1 | 66654654 | 02/09/2021

I tried to make an explode but I couldn't do it, how can I get this result?

To reproduce it:

import json api_response = {'02/09/2021':{'ABC':{'emp':'A1','value':'12421'},'DEF':{'emp':'D1','value':'3345'},'GHI':{'emp':'G2','value':'260048836600'},'JKL':{'emp':'J1','value':'66654654'}}} rdd = spark.sparkContext.parallelize([json.dumps(api_response)]) input_df = spark.read.json(rdd)

yeah but I don't know how to read it. Update your question doing these two steps, first collecting some data from your dataframe running [dataframe].limit(3).collect() and get the schema [dataframe].schema.simpleString() where [dataframe] is your variable name. It will be easier to reproduce it — Kafels
– Kafels, Commented Sep 19, 2021 at 21:30

Kafels · Accepted Answer · 2021-09-19 22:46:11Z

The easier way to parse your api response into dataframe is first transforming the result, removing the key date and putting it into the body. To transforming it into a table it's required to use stack:

import json def transform_api_content(api_response): for key, value in api_response.items(): value['DATE'] = key yield json.dumps(value) api_response = {'02/09/2021':{'ABC':{'emp':'A1','value':'12421'},'DEF':{'emp':'D1','value':'3345'},'GHI':{'emp':'G2','value':'260048836600'},'JKL':{'emp':'J1','value':'66654654'}}} input_df = spark.read.json(sc.parallelize(transform_api_content(api_response))) # Hard-coded stack = 'stack(4, "ABC", ABC.emp, ABC.value, "DEF", DEF.emp, DEF.value, "GHI", GHI.emp, GHI.value, "JKL", JKL.emp, JKL.value) AS (CODE, EMP, VALUE)' output_df = input_df.selectExpr(stack, 'DATE') output_df.show(truncate=False) # Dynamic def stack_columns(dataframe): struct_cols = [column for column, schema in dataframe.dtypes if schema.startswith('struct')] stack_expr = ['"{0}", {0}.emp, {0}.value'.format(column) for column in struct_cols] return 'stack({length}, {stack}) AS (CODE, EMP, VALUE)'.format(length=len(struct_cols), stack=','.join(stack_expr)) output_df = input_df.selectExpr(stack_columns(input_df), 'DATE') output_df.show(truncate=False)

Output

+----+---+------------+----------+ |CODE|EMP|VALUE |DATE | +----+---+------------+----------+ |ABC |A1 |12421 |02/09/2021| |DEF |D1 |3345 |02/09/2021| |GHI |G2 |260048836600|02/09/2021| |JKL |J1 |66654654 |02/09/2021| +----+---+------------+----------+

Collectives™ on Stack Overflow

Pyspark - normalize a dataframe

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related