I'm using pyspark. So I have a spark dataframe that looks like:
a | b | c 5 | 2 | 1 5 | 4 | 3 2 | 4 | 2 2 | 3 | 7 Need Output:
a | b_list 5 | 2,1,4,3 2 | 4,2,3,7 It's important to keep the sequence as given in output.
I'm using pyspark. So I have a spark dataframe that looks like:
a | b | c 5 | 2 | 1 5 | 4 | 3 2 | 4 | 2 2 | 3 | 7 Need Output:
a | b_list 5 | 2,1,4,3 2 | 4,2,3,7 It's important to keep the sequence as given in output.
Instead of udf, for joining the list, we can also use concat_ws function as suggested in comments above, like this:
import pyspark.sql.functions as F df = (df .withColumn('lst', F.concat(df['b'], F.lit(','), df['c']).alias('lst')) .groupBy('a') .agg( F.concat_ws(',', F.collect_list('lst').alias('b_list')).alias('lst'))) df.show() +---+-------+ | a| lst| +---+-------+ | 5|2,1,4,3| | 2|4,2,3,7| +---+-------+ The following results in the last 2 columns aggregated into an array column:
df1 = df.withColumn('lst', f.concat(df['b'], f.lit(','), df['c']).alias('lst'))\ .groupBy('a')\ .agg( f.collect_list('lst').alias('b_list')) Now join array elements:
#Simplistic udf to joing array: def join_array(col): return ','.join(col) join = f.udf(join_array) df1.select('a', join(df1['b_list']).alias('b_list'))\ .show() Printing:
+---+-------+ | a| b_list| +---+-------+ | 5|2,1,4,3| | 2|4,2,3,7| +---+-------+ pyspark.sql.functions.concat_ws to do the join which will be faster than using a udf.collect_list) to concat_ws - for example, take a look at this answer.