Read value from table and apply condition in Spark

Question

I have dataframe: df1

+------+--------+--------+--------+ | Name | value1 | value2 | value3 | +------+--------+--------+--------+ | A | 100 | null | 200 | | B | 10000 | 300 | 10 | | c | null | 10 | 100 | +------+--------+--------+--------+

second dataframe: df2:

+------+------+ | Col1 | col2 | +------+------+ | X | 1000 | | Y | 2002 | | Z | 3000 | +------+------+

I want to read the values from table1 like value1,value2 and value3

Apply condition to table2 with new columns:

cond1: when name= A and col2>value1, flag it to Y or N

cond2: when name= B and col2>value2 then Y or N

cond3: when name =c and col2>value1 and col2> value3, then Y or N

source code:

df2.withColumn("cond1",when($"col2")>value1,lit("Y)).otherwise(lit("N")) df2.withColumn("cond2",when($"col2")>value2,lit("Y)).otherwise(lit("N")) df2.withColumn("cond3",when($"col2")>value1 && when($"col2")>value3,lit("Y")).otherwise(lit("N"))

output:

+------+------+-------+-------+-------+ | Col1 | col2 | cond1 | cond2 | cond3 | +------+------+-------+-------+-------+ | X | 1000 | Y | Y | y | | Y | 2002 | N | Y | Y | | Z | 3000 | Y | Y | Y | +------+------+-------+-------+-------+

I want to read the values from value1 , value2, value3 store in variable, and apply condition based on value — vikv
– vikv, Commented Sep 17, 2017 at 21:55
@vikv I think what he is asking is, should the "Name" column in df1 equal to "Col1" in df2 for the rows to be matched. Or is it based on order or something else? — Shaido
– Shaido, Commented Sep 18, 2017 at 1:14
I still don't understand what value1, value2 and value3 are in your example. Do you add the three values or do you want to do the process for each row in the first df? — moe
– moe, Commented Sep 18, 2017 at 12:18

Leo C · Accepted Answer · 2017-09-18 16:04:32Z

If I understand your question correctly, you can join the two dataframes and create the condition columns as shown below. A couple of notes:

1) With the described conditions,null in df1 is replaced with Int.MinValue for simplified integer comparison

2) Since df1 is small, broadcast join is used to minimize sorting/shuffling for better performance

val df1 = Seq( ("A", 100, Int.MinValue, 200), ("B", 10000, 300, 10), ("C", Int.MinValue, 10, 100) ).toDF("Name", "value1", "value2", "value3") val df2 = Seq( ("A", 1000), ("B", 2002), ("C", 3000), ("A", 5000), ("A", 150), ("B", 250), ("B", 12000), ("C", 50) ).toDF("Col1", "col2") val df3 = df2.join(broadcast(df1), df2("Col1") === df1("Name")).select( df2("Col1"), df2("col2"), when(df2("col2") > df1("value1"), "Y").otherwise("N").as("cond1"), when(df2("col2") > df1("value2"), "Y").otherwise("N").as("cond2"), when(df2("col2") > df1("value1") && df2("col2") > df1("value3"), "Y").otherwise("N").as("cond3") ) df3.show +----+-----+-----+-----+-----+ |Col1| col2|cond1|cond2|cond3| +----+-----+-----+-----+-----+ | A| 1000| Y| Y| Y| | B| 2002| N| Y| N| | C| 3000| Y| Y| Y| | A| 5000| Y| Y| Y| | A| 150| Y| Y| N| | B| 250| N| N| N| | B|12000| Y| Y| Y| | C| 50| Y| Y| N| +----+-----+-----+-----+-----+

Anahcolus · Accepted Answer · 2017-09-18 02:48:18Z

You can create rowNo column in both dataframes as below

import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions._ val tempdf1 = df1.withColumn("rowNo", row_number().over(Window.orderBy("Name"))) val tempdf2 = df2.withColumn("rowNo", row_number().over(Window.orderBy("Col1")))

Then you can join them with the created column as below

val joinedDF = tempdf2.join(tempdf1, Seq("rowNo"), "left")

Finally you can use select and when function to get the final dataframe

joinedDF.select($"Col1", $"col2", when($"col2">$"value1" || $"value1".isNull, "Y").otherwise("N").as("cond1"), when($"col2">$"value2" || $"value2".isNull, "Y").otherwise("N").as("cond2"), when(($"col2">$"value1" && $"col2">$"value3") || $"value3".isNull, "Y").otherwise("N").as("cond3"))

you should have your desired dataframe as

+----+----+-----+-----+-----+ |Col1|col2|cond1|cond2|cond3| +----+----+-----+-----+-----+ |X |1000|Y |Y |Y | |Y |2002|N |Y |Y | |Z |3000|Y |Y |Y | +----+----+-----+-----+-----+

I hope the answer is helpful

thanks for response.but this scenario doesnt work if df1 and df2 have different number of rows. In my scenario df1 is of 3 rows small table, but my df2 is always big. Just I gave example usecase in question.
if thats the case then how do you compare for df1 having more than three rows and df2 having only three rows? can you explain that? which rows of df1 will be compared with which rows of df2?

Collectives™ on Stack Overflow

Read value from table and apply condition in Spark

2 Answers 2

Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Related