1

I have a dataframe:

+---+---+---+------+ | id|foo|bar|rownum| +---+---+---+------+ | 1|123|123| 1| | 2|000|236| 1| | 2|236|236| 2| | 2|000|236| 3| | 3|333|234| 1| | 3|444|444| 2| +---+---+---+------+ 

I want to add a column match which will hold the rownum where foo==bar, like:

+---+---+---+------+----+ | id|foo|bar|rownum|match +---+---+---+------+----+ | A|123|123| 1| 1| | B|000|236| 1| 2| | B|236|236| 2| 2| | B|000|236| 3| 2| | R|333|234| 1| 2| | R|444|444| 2| 2| +---+---+---+------+----+ 

I tried this:

df_grp2 = df_grp2.withColumn('match',when(F.col('foo')==F.col('bar'), F.col('rownum'))) 

2 Answers 2

2

Try using window functions.

from pyspark.sql import functions as F, Window as W df_grp2 = spark.createDataFrame( [(1, '123', '123', 1), (2, '000', '236', 1), (2, '236', '236', 2), (2, '000', '236', 3), (3, '333', '234', 1), (3, '444', '444', 2)], ['id', 'foo', 'bar', 'rownum'] ) df_grp2 = df_grp2.withColumn( 'match', F.first(F.when(F.col('foo') == F.col('bar'), F.col('rownum')), True).over(W.partitionBy('id')) ) df_grp2.show() # +---+---+---+------+-----+ # | id|foo|bar|rownum|match| # +---+---+---+------+-----+ # | 1|123|123| 1| 1| # | 2|000|236| 1| 2| # | 2|236|236| 2| 2| # | 2|000|236| 3| 2| # | 3|333|234| 1| 2| # | 3|444|444| 2| 2| # +---+---+---+------+-----+ 
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, I now understand how to match values over a window. :)
0

use this :

df['match'] = df.loc[df['foo'] == df['bar']]['rownum'] 

But if they are not match it returns 'NAN'

+---+---+---+------+----+ | id|foo|bar|rownum|match +---+---+---+------+----+ | A|123|123| 1| 1| | B|000|236| 1| NAN| | B|236|236| 2| 2| | B|000|236| 3| NAN| | R|333|234| 1| NAN| | R|444|444| 2| 2| +---+---+---+------+----+ 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.