1

I have a dataframe that looks like below

ColName
a
b
c
d
e
f
g
h
i
j
k
l

and based on an specific parameter I want to transpose those values into rows. So for example if the parameter value is 3, the new dataframe will look like below

Col1 Col2 Col3
a b c
d e f
g h i
j k l

However if the parameter value is 4, it will look like below

Col1 Col2 Col3 Col4
a b c d
e f g h
i j k l

A few things to notice:

  1. The column names are not important
  2. Both the number of items in that single column and the parameter can change

Any idea how to achieve this in pyspark? Thanks in advance.

2 Answers 2

1

You can add some helper columns to pivot the dataframe:

import pyspark.sql.functions as F x = 3 result = df.withColumn( 'id', F.monotonically_increasing_id() ).withColumn( 'id2', (F.col('id') / x).cast('int') ).withColumn( 'id3', F.col('id') % x ).groupBy('id2').pivot('id3').agg(F.first('ColName')).orderBy('id2').drop('id2') result.show() +---+---+---+ | 0| 1| 2| +---+---+---+ | a| b| c| | d| e| f| | g| h| i| | j| k| l| +---+---+---+ 
Sign up to request clarification or add additional context in comments.

2 Comments

Would that work given the value of "x" is parameterized and can change based on who is calling that code?
yes, you can try changing the value of x and it should still work as long as its value is sensible
0

You can use a combination of collect_list and row_number to achieve this.

Step 1: Generate a custom row_number.

from pyspark.sql.functions import floor, row_number, collect_list from pyspark.sql.window import Window no_of_columns = 3 df2 = df.withColumn("row_num", floor((row_number().over(Window.orderBy("ColName"))-1)/no_of_columns)) 

Step 2: Group the data using this row_number and use collect_list to create a list.

df3 = df2.groupBy("row_num").agg(collect_list("ColName").alias("col_list")) 

Step 3: Use python's list comprehension to select all the elements from this list.

df3.select(*[df3.col_list[i].alias(f"col{i+1}") for i in range(no_of_columns)]).show() 

Output:

+----+----+----+ |col1|col2|col3| +----+----+----+ | a| b| c| | d| e| f| | g| h| i| | j| k| l| +----+----+----+ 

Note: The parameter no_of_columns can be changed based on the required number of output columns.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.