How to dynamically pivot/transpose columns in pyspark dataframe

Question

Input:

ID	Question-How are you-Response	Question-How are you-date	Question How are you-expected response	......	Question-XXX-response	Question-XXX-date	question-XXX-expected-response
123	value a	17-Aug-22	value b	.....	value c	15-Aug-22	value d

Expected output:

ID	Question	response_given	date	Expected_Response
123	How are you	value a	17-Aug-22	value b
..	..	..	..	..
123	XXX	value c	15-Aug-22	value d

Jonathan · Accepted Answer · 2022-08-18 15:02:25Z

You can use the loop and union:

# Given that n = 100 n = 100 for idx in range(1, 101): new_df = df.select('id', func.lit(idx).alias('question'), func.col(f"question_{idx}_response").alias('response_given'), func.col(f"question_{idx}_date").alias('date'), func.col(f"question_{idx}_expected_response").alias('expected_response')) if idx == 1: output = new_df else: output = output.unionAll(new_df)

Edit: 2022-08-18

Although this is not a spark related question, based on your edit question, you have the format of the column name and you can:

column_name_list = list(set([col.split('-')[1] for col in df.columns[1:]])) for col in column_name_list: # same as before ....

Thank you, How can we loop this if question is text and alphanumeric? have updated my question accordingly.

samkart · Accepted Answer · 2022-08-19 09:56:28Z

If your column names are consistent - as in it is always delimited by hyphens and the cases/spellings are same, we can unpivot the columns to rows and extract info from the column names. This info can be pivoted to get the desired result.

# input data - maintain consistency in column names data_sdf = spark.sparkContext.parallelize(data_ls). \ toDF(['id', 'Question-How are you-response', 'Question-How are you-date', 'Question-How are you-expected response', 'Question-XXX-response', 'Question-XXX-date', 'Question-XXX-expected response' ]) # +---+-----------------------------+-------------------------+--------------------------------------+---------------------+-----------------+------------------------------+ # | id|Question-How are you-response|Question-How are you-date|Question-How are you-expected response|Question-XXX-response|Question-XXX-date|Question-XXX-expected response| # +---+-----------------------------+-------------------------+--------------------------------------+---------------------+-----------------+------------------------------+ # |123| value a| 17-Aug-22| value b| value c| 15-Aug-22| value d| # +---+-----------------------------+-------------------------+--------------------------------------+---------------------+-----------------+------------------------------+

Use this detail to extract the info after unpivot. Use split to create an array out of the column names.

unpivot_data_sdf = data_sdf. \ withColumn('allcol_struct_arr', func.array(*[func.struct(func.lit(k).alias('key'), func.col(k).alias('val')) for k in data_sdf.columns if k != 'id'] ) ). \ selectExpr('id', 'inline(allcol_struct_arr)'). \ withColumn('key_split', func.split('key', '-')). \ withColumn('question', func.col('key_split')[1]). \ withColumn('question_field', func.col('key_split')[2]) # +---+--------------------------------------+---------+------------------------------------------+-----------+-----------------+ # |id |key |val |key_split |question |question_field | # +---+--------------------------------------+---------+------------------------------------------+-----------+-----------------+ # |123|Question-How are you-response |value a |[Question, How are you, response] |How are you|response | # |123|Question-How are you-date |17-Aug-22|[Question, How are you, date] |How are you|date | # |123|Question-How are you-expected response|value b |[Question, How are you, expected response]|How are you|expected response| # |123|Question-XXX-response |value c |[Question, XXX, response] |XXX |response | # |123|Question-XXX-date |15-Aug-22|[Question, XXX, date] |XXX |date | # |123|Question-XXX-expected response |value d |[Question, XXX, expected response] |XXX |expected response| # +---+--------------------------------------+---------+------------------------------------------+-----------+-----------------+

All that's left to do is to pivot the question_field column

unpivot_data_sdf. \ groupBy('id', 'question'). \ pivot('question_field'). \ agg(func.first('val')). \ show(truncate=False) # +---+-----------+---------+-----------------+--------+ # |id |question |date |expected response|response| # +---+-----------+---------+-----------------+--------+ # |123|XXX |15-Aug-22|value d |value c | # |123|How are you|17-Aug-22|value b |value a | # +---+-----------+---------+-----------------+--------+

Collectives™ on Stack Overflow

How to dynamically pivot/transpose columns in pyspark dataframe

2 Answers 2

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Related