How to dynamically transpose a single Column to multiple Rows in pyspark?

Question

I have a dataframe that looks like below

ColName
a
b
c
d
e
f
g
h
i
j
k
l

and based on an specific parameter I want to transpose those values into rows. So for example if the parameter value is 3, the new dataframe will look like below

Col1	Col2	Col3
a	b	c
d	e	f
g	h	i
j	k	l

However if the parameter value is 4, it will look like below

Col1	Col2	Col3	Col4
a	b	c	d
e	f	g	h
i	j	k	l

A few things to notice:

The column names are not important
Both the number of items in that single column and the parameter can change

Any idea how to achieve this in pyspark? Thanks in advance.

mck · Accepted Answer · 2021-02-10 16:20:14Z

You can add some helper columns to pivot the dataframe:

import pyspark.sql.functions as F x = 3 result = df.withColumn( 'id', F.monotonically_increasing_id() ).withColumn( 'id2', (F.col('id') / x).cast('int') ).withColumn( 'id3', F.col('id') % x ).groupBy('id2').pivot('id3').agg(F.first('ColName')).orderBy('id2').drop('id2') result.show() +---+---+---+ | 0| 1| 2| +---+---+---+ | a| b| c| | d| e| f| | g| h| i| | j| k| l| +---+---+---+

Would that work given the value of "x" is parameterized and can change based on who is calling that code?
yes, you can try changing the value of x and it should still work as long as its value is sensible

noufel13 · Accepted Answer · 2021-02-11 10:27:29Z

You can use a combination of collect_list and row_number to achieve this.

Step 1: Generate a custom row_number.

from pyspark.sql.functions import floor, row_number, collect_list from pyspark.sql.window import Window no_of_columns = 3 df2 = df.withColumn("row_num", floor((row_number().over(Window.orderBy("ColName"))-1)/no_of_columns))

Step 2: Group the data using this row_number and use collect_list to create a list.

df3 = df2.groupBy("row_num").agg(collect_list("ColName").alias("col_list"))

Step 3: Use python's list comprehension to select all the elements from this list.

df3.select(*[df3.col_list[i].alias(f"col{i+1}") for i in range(no_of_columns)]).show()

Output:

+----+----+----+ |col1|col2|col3| +----+----+----+ | a| b| c| | d| e| f| | g| h| i| | j| k| l| +----+----+----+

Note: The parameter no_of_columns can be changed based on the required number of output columns.

Collectives™ on Stack Overflow

How to dynamically transpose a single Column to multiple Rows in pyspark?

2 Answers 2

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Related