If your column names are consistent - as in it is always delimited by hyphens and the cases/spellings are same, we can unpivot the columns to rows and extract info from the column names. This info can be pivoted to get the desired result.
# input data - maintain consistency in column names data_sdf = spark.sparkContext.parallelize(data_ls). \ toDF(['id', 'Question-How are you-response', 'Question-How are you-date', 'Question-How are you-expected response', 'Question-XXX-response', 'Question-XXX-date', 'Question-XXX-expected response' ]) # +---+-----------------------------+-------------------------+--------------------------------------+---------------------+-----------------+------------------------------+ # | id|Question-How are you-response|Question-How are you-date|Question-How are you-expected response|Question-XXX-response|Question-XXX-date|Question-XXX-expected response| # +---+-----------------------------+-------------------------+--------------------------------------+---------------------+-----------------+------------------------------+ # |123| value a| 17-Aug-22| value b| value c| 15-Aug-22| value d| # +---+-----------------------------+-------------------------+--------------------------------------+---------------------+-----------------+------------------------------+
Use this detail to extract the info after unpivot. Use split to create an array out of the column names.
unpivot_data_sdf = data_sdf. \ withColumn('allcol_struct_arr', func.array(*[func.struct(func.lit(k).alias('key'), func.col(k).alias('val')) for k in data_sdf.columns if k != 'id'] ) ). \ selectExpr('id', 'inline(allcol_struct_arr)'). \ withColumn('key_split', func.split('key', '-')). \ withColumn('question', func.col('key_split')[1]). \ withColumn('question_field', func.col('key_split')[2]) # +---+--------------------------------------+---------+------------------------------------------+-----------+-----------------+ # |id |key |val |key_split |question |question_field | # +---+--------------------------------------+---------+------------------------------------------+-----------+-----------------+ # |123|Question-How are you-response |value a |[Question, How are you, response] |How are you|response | # |123|Question-How are you-date |17-Aug-22|[Question, How are you, date] |How are you|date | # |123|Question-How are you-expected response|value b |[Question, How are you, expected response]|How are you|expected response| # |123|Question-XXX-response |value c |[Question, XXX, response] |XXX |response | # |123|Question-XXX-date |15-Aug-22|[Question, XXX, date] |XXX |date | # |123|Question-XXX-expected response |value d |[Question, XXX, expected response] |XXX |expected response| # +---+--------------------------------------+---------+------------------------------------------+-----------+-----------------+
All that's left to do is to pivot the question_field column
unpivot_data_sdf. \ groupBy('id', 'question'). \ pivot('question_field'). \ agg(func.first('val')). \ show(truncate=False) # +---+-----------+---------+-----------------+--------+ # |id |question |date |expected response|response| # +---+-----------+---------+-----------------+--------+ # |123|XXX |15-Aug-22|value d |value c | # |123|How are you|17-Aug-22|value b |value a | # +---+-----------+---------+-----------------+--------+