4

I am quite new to spark and can't get it to work... Hopefully, there is an easy way of doing this... What I am trying to do is best described by the following table: (I need to get the "required" column)

 colA colB colC ref required 1 a1 b1 c1 colA a1 2 a2 b2 c2 colA a2 3 a3 b3 c3 colB b3 4 a4 b4 c4 colB b4 5 a5 b5 c5 colC c5 6 a6 b6 c6 colC c6 

The above is just an example - in the real example I have >50 columns, so doing conditions is not going to work...

I know this can be easily done in pandas using something like:

df['required'] = df.apply(lambda x: x.loc[x.ref], axis=1) 

or

df['required'] = df.lookup(df.index, df.ref) 

Any suggestions how to do this in PySpark?

1 Answer 1

9

One way to do this is to use when and coalesce functions:

import pyspark.sql.functions as F cols = ['colA', 'colB', 'colC'] df.withColumn('required', F.coalesce(*[F.when(df.ref == c, df[c]) for c in cols])).show() +----+----+----+----+--------+ |colA|colB|colC| ref|required| +----+----+----+----+--------+ | a1| b1| c1|colA| a1| | a2| b2| c2|colA| a2| | a3| b3| c3|colB| b3| | a4| b4| c4|colB| b4| | a5| b5| c5|colC| c5| | a6| b6| c6|colC| c6| +----+----+----+----+--------+ 

Basically you check which column's name the ref column is equal to, and take the value from that column -- F.when(df.ref == c, df[c]); This creates a list of column objects whose values are kept when its name appears in the ref column, otherwise its values are NULL; Then by coalescing the list of columns, NULL values are filled with values from a valid column values.

Sign up to request clarification or add additional context in comments.

1 Comment

Worked like a charm! Thank you!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.