0

Input:

ID Question-How are you-Response Question-How are you-date Question How are you-expected response ...... Question-XXX-response Question-XXX-date question-XXX-expected-response
123 value a 17-Aug-22 value b ..... value c 15-Aug-22 value d

Expected output:

ID Question response_given date Expected_Response
123 How are you value a 17-Aug-22 value b
.. .. .. .. ..
123 XXX value c 15-Aug-22 value d

2 Answers 2

1

You can use the loop and union:

# Given that n = 100 n = 100 for idx in range(1, 101): new_df = df.select('id', func.lit(idx).alias('question'), func.col(f"question_{idx}_response").alias('response_given'), func.col(f"question_{idx}_date").alias('date'), func.col(f"question_{idx}_expected_response").alias('expected_response')) if idx == 1: output = new_df else: output = output.unionAll(new_df) 

Edit: 2022-08-18

Although this is not a spark related question, based on your edit question, you have the format of the column name and you can:

column_name_list = list(set([col.split('-')[1] for col in df.columns[1:]])) for col in column_name_list: # same as before .... 
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you, How can we loop this if question is text and alphanumeric? have updated my question accordingly.
1

If your column names are consistent - as in it is always delimited by hyphens and the cases/spellings are same, we can unpivot the columns to rows and extract info from the column names. This info can be pivoted to get the desired result.

# input data - maintain consistency in column names data_sdf = spark.sparkContext.parallelize(data_ls). \ toDF(['id', 'Question-How are you-response', 'Question-How are you-date', 'Question-How are you-expected response', 'Question-XXX-response', 'Question-XXX-date', 'Question-XXX-expected response' ]) # +---+-----------------------------+-------------------------+--------------------------------------+---------------------+-----------------+------------------------------+ # | id|Question-How are you-response|Question-How are you-date|Question-How are you-expected response|Question-XXX-response|Question-XXX-date|Question-XXX-expected response| # +---+-----------------------------+-------------------------+--------------------------------------+---------------------+-----------------+------------------------------+ # |123| value a| 17-Aug-22| value b| value c| 15-Aug-22| value d| # +---+-----------------------------+-------------------------+--------------------------------------+---------------------+-----------------+------------------------------+ 

Use this detail to extract the info after unpivot. Use split to create an array out of the column names.

unpivot_data_sdf = data_sdf. \ withColumn('allcol_struct_arr', func.array(*[func.struct(func.lit(k).alias('key'), func.col(k).alias('val')) for k in data_sdf.columns if k != 'id'] ) ). \ selectExpr('id', 'inline(allcol_struct_arr)'). \ withColumn('key_split', func.split('key', '-')). \ withColumn('question', func.col('key_split')[1]). \ withColumn('question_field', func.col('key_split')[2]) # +---+--------------------------------------+---------+------------------------------------------+-----------+-----------------+ # |id |key |val |key_split |question |question_field | # +---+--------------------------------------+---------+------------------------------------------+-----------+-----------------+ # |123|Question-How are you-response |value a |[Question, How are you, response] |How are you|response | # |123|Question-How are you-date |17-Aug-22|[Question, How are you, date] |How are you|date | # |123|Question-How are you-expected response|value b |[Question, How are you, expected response]|How are you|expected response| # |123|Question-XXX-response |value c |[Question, XXX, response] |XXX |response | # |123|Question-XXX-date |15-Aug-22|[Question, XXX, date] |XXX |date | # |123|Question-XXX-expected response |value d |[Question, XXX, expected response] |XXX |expected response| # +---+--------------------------------------+---------+------------------------------------------+-----------+-----------------+ 

All that's left to do is to pivot the question_field column

unpivot_data_sdf. \ groupBy('id', 'question'). \ pivot('question_field'). \ agg(func.first('val')). \ show(truncate=False) # +---+-----------+---------+-----------------+--------+ # |id |question |date |expected response|response| # +---+-----------+---------+-----------------+--------+ # |123|XXX |15-Aug-22|value d |value c | # |123|How are you|17-Aug-22|value b |value a | # +---+-----------+---------+-----------------+--------+ 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.