0

I have one data frame as ( This is overall data frame) with 0s and 1s

+---+-----+ |key|value| +---+-----+ | a| 0.5| | b| 0.4| | c| 0.5| | d| 0.3| | x| 0.0| | y| 0.0| | z| 0.0| +---+-----+ 

and the second dataframe is ( Bad Output ) ( Should contain only 0s )

+---+-----+ |key|value| +---+-----+ | a| 0.0| | e| 0.0| | f| 0.0| | g| 0.0| +---+-----+ Note : the value of `a` has chnaged 

How to write my script so I can get the following output of my second Dataframe ( only 0s and as value of a is 1 in good data frame , i want to remove it from bad one

+---+-----+ |key|value| +---+-----+ | e| 0.0| | f| 0.0| | g| 0.0| | x| 0.0| | y| 0.0| | z| 0.0| +---+-----+ 

2 Answers 2

1

Non-zero overall values can be removed from bad output, and zero overall values added (Scala):

val overall = Seq( ("a", 0.5), ("b", 0.4), ("c", 0.5), ("d", 0.3), ("x", 0.0), ("y", 0.0), ("z", 0.0), ).toDF("key", "value") val badOutput = Seq( ("a", 0.0), ("e", 0.0), ("f", 0.0), ("g", 0.0) ) .toDF("key", "value") badOutput .except(overall.where($"value"=!=0).withColumn("value", lit(0.0))) .union (overall.where($"value"===0)) 
Sign up to request clarification or add additional context in comments.

Comments

0

You can union two dataframes then groupBy + array_contains function to get the desired result.

Example:

df.show() #+---+-----+ #|key|value| #+---+-----+ #| a| 1| #| b| 1| #| c| 1| #| d| 1| #| x| 0| #| y| 0| #| z| 0| #+---+-----+ df1.show() #+---+-----+ #|key|value| #+---+-----+ #| a| 0| #| e| 0| #| f| 0| #| g| 0| #+---+-----+ df2=df.unionAll(df1) df3=df2.groupBy("key").agg(collect_list(col("value")).alias("lst")) df3.filter(~array_contains("lst",1)).\ withColumn("lst",array_join(col("lst"),'')).\ show() #+---+---+ #|key|lst| #+---+---+ #| x| 0| #| g| 0| #| f| 0| #| e| 0| #| z| 0| #| y| 0| #+---+---+ 

4 Comments

It does not return the right output, but I get the idea of how to move forward. Its gives output as +---+---+ |key|lst| +---+---+ | g| 0| | f| 0| | e| 0| | a| 0| +---+---+
I dont think so.. I tried with your input data mentioned in the question.. are u sure that you are using correct filter condition?
oh actually i see now you are doing union in a previous statement. I missed reading the union line. All good.
Could there be an alternative to df3.filter(array_contains("value",0.0)) as my df can contain double values as well. With the above line, I am getting | a|0.50.0| as well in my final output. ( I am updating dataframe in my question )

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.