Transpose rows to Columns in Spark SQL (pyspark)

Question

I want to make the following transformations in Spark My objective is to get output and I hope If I can make Intermediate transformation, I can easily get output. Any ideas of how to convert rows to columns would be much helpful.

RowID Name Place 1 Gaga India,US,UK 1 Katy UK,India,Europe 1 Bey Europe 2 Gaga Null 2 Katy India,Europe 2 Bey US 3 Gaga Europe 3 Katy US 3 Bey Null Output: RowID Id Gaga Katy Bey 1 1 India UK Europe 1 2 US India Null 1 3 UK Europe Null 2 1 Null India US 2 2 Null Europe Null 3 1 Europe US Null Intermediate Output: RowID Gaga Katy Bey 1 India,US,UK UK,India,Europe Europe 2 Null India,Europe US 3 Europe US Null

Suresh · Accepted Answer · 2017-10-25 11:08:07Z

Using Dataframe functions and UDF, I have tried my way. Hope it helps you.

>>> from pyspark.sql import functions as F >>> from pyspark.sql.types import IntegerType >>> from functools import reduce >>> from pyspark.sql import DataFrame >>> from pyspark.sql import Window >>> l = [(1,'Gaga','India,US,UK'),(1,'Katy','UK,India,Europe'),(1,'Bey','Europe'),(2,'Gaga',None),(2,'Katy','India,Europe'),(2,'Bey','US'),(3,'Gaga','Europe'), ... (3,'Katy','US'),(3,'Bey',None)] >>> df = spark.createDataFrame(l,['RowID','Name','Place']) >>> df = df.withColumn('Placelist',F.split(df.Place,',')) >>> df.show() +-----+----+---------------+-------------------+ |RowID|Name| Place| Placelist| +-----+----+---------------+-------------------+ | 1|Gaga| India,US,UK| [India, US, UK]| | 1|Katy|UK,India,Europe|[UK, India, Europe]| | 1| Bey| Europe| [Europe]| | 2|Gaga| null| null| | 2|Katy| India,Europe| [India, Europe]| | 2| Bey| US| [US]| | 3|Gaga| Europe| [Europe]| | 3|Katy| US| [US]| | 3| Bey| null| null| +-----+----+---------------+-------------------+ >>> udf1 = F.udf(lambda x : len(x) if x is not None else 0,IntegerType()) >>> maxlen = df.agg(F.max(udf1('Placelist'))).first()[0] >>> df1 = df.groupby('RowID').pivot('Name').agg(F.first('Placelist')) >>> df1.show() +-----+--------+---------------+-------------------+ |RowID| Bey| Gaga| Katy| +-----+--------+---------------+-------------------+ | 1|[Europe]|[India, US, UK]|[UK, India, Europe]| | 3| null| [Europe]| [US]| | 2| [US]| null| [India, Europe]| +-----+--------+---------------+-------------------+ >>> finaldf = reduce( ... DataFrame.unionAll, ... (df1.select("RowID", F.col("Bey").getItem(i), F.col("Gaga").getItem(i),F.col("Katy").getItem(i) ) ... for i in range(maxlen)) ... ).toDF(*df1.columns).na.drop('all',subset=df1.columns[1:]).orderBy('RowID') >>> w = Window.partitionBy('RowID').orderBy('Bey') >>> finaldf = finaldf.withColumn('ID',F.row_number().over(w)) >>> finaldf.select('RowID','ID','Gaga','Katy','Bey').show() +-----+---+------+------+------+ |RowID| ID| Gaga| Katy| Bey| +-----+---+------+------+------+ | 1| 1| US| India| null| | 1| 2| UK|Europe| null| | 1| 3| India| UK|Europe| | 2| 1| null|Europe| null| | 2| 2| null| India| US| | 3| 1|Europe| US| null| +-----+---+------+------+------+

Thanks for your help mate! This code is really difficult to understand as I am new to Python but it works :-)
I will try to explain the steps in sometime, if it has helped you , please accept the answer.

Edward Cheu · Accepted Answer · 2020-02-16 10:35:25Z

Alternative solution without using UDF:


 from pyspark.sql import Row from pyspark.sql.types import StructField, StructType, StringType, IntegerType from pyspark.sql.window import Window from pyspark.sql.functions import create_map, explode, struct, split, row_number, to_json from functools import reduce 
 /* DataFrame Schema */
 dfSchema = StructType([ StructField('RowID', IntegerType()), StructField('Name', StringType()), StructField('Place', StringType()) ]) 
 /* Raw Data */
 rowID_11 = Row(1, 'Gaga', 'India,US,UK') rowID_12 = Row(1, 'Katy', 'UK,India,Europe') rowID_13 = Row(1, 'Bey', 'Europe') rowID_21 = Row(2, 'Gaga', None) rowID_22 = Row(2, 'Katy', 'India,Europe') rowID_23 = Row(2, 'Bey', 'US') rowID_31 = Row(3, 'Gaga', 'Europe') rowID_32 = Row(3, 'Katy', 'US') rowID_33 = Row(3, 'Bey', None) rowList = [rowID_11, rowID_12, rowID_13, rowID_21, rowID_22, rowID_23, rowID_31, rowID_32, rowID_33] 
 /* Create initial DataFrame */
 df = spark.createDataFrame(rowList, dfSchema) df.show() 
 +-----+----+---------------+ |RowID|Name| Place| +-----+----+---------------+ | 1|Gaga| India,US,UK| | 1|Katy|UK,India,Europe| | 1| Bey| Europe| | 2|Gaga| null| | 2|Katy| India,Europe| | 2| Bey| US| | 3|Gaga| Europe| | 3|Katy| US| | 3| Bey| null| +-----+----+---------------+
 /* Use create_map, struct and to_json to create intermediate output */
 jsonDFCol = df.select( to_json( create_map('Name', struct('RowID', 'Place')))\ .alias('name_place')) jsonList = [js[0] for js in jsonDFCol.rdd.collect()] jsonDF = spark.read.json(sc.parallelize(jsonList)) intermediateList = [jsonDF .selectExpr(f'{name}.RowID', f'{name}.Place AS {name}')\ .where('RowID is not Null') for name in jsonDF .columns] intermediateDF = reduce(lambda curr, nxt: curr.join(nxt, on='RowID'), intermediateList).sort('RowID')\ .select('RowID', 'Gaga', 'Katy', 'Bey') intermediateDF.show() 
 +-----+-----------+---------------+------+ |RowID| Gaga| Katy| Bey| +-----+-----------+---------------+------+ | 1|India,US,UK|UK,India,Europe|Europe| | 2| null| India,Europe| US| | 3| Europe| US| null| +-----+-----------+---------------+------+
 /* Use window to create Id column */
 rowWindow = Window.partitionBy('RowID').orderBy('RowID') 
 /* Use split and explode functions to obtain final output */
 finalDFList = \ [intermediateDF\ .select('RowID', explode(split(intermediateDF[col_], ',')).alias(col_))\ .withColumn('id', row_number().over(rowWindow)) for col_ in intermediateDF.columns[1:]] finalDFID = reduce(lambda curr, nxt: curr.select('RowID', 'Id')\ .unionAll(nxt.select('RowId', 'Id')), finalDFList) finalDF = reduce(lambda curr, nxt: curr.join(nxt, on=['RowID', 'Id'], how='left'), finalDFList, finalDFID).distinct()\ .sort('RowId', 'Id')\ .select('RowID', 'Id', 'Gaga', 'Katy', 'Bey') finalDF.show() 
 +-----+---+------+------+------+ |RowID| Id| Gaga| Katy| Bey| +-----+---+------+------+------+ | 1| 1| India| UK|Europe| | 1| 2| US| India| null| | 1| 3| UK|Europe| null| | 2| 1| null| India| US| | 2| 2| null|Europe| null| | 3| 1|Europe| US| null| +-----+---+------+------+------+

Collectives™ on Stack Overflow

Transpose rows to Columns in Spark SQL (pyspark)

2 Answers 2

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Related