1

I have a Spark dataframe and I want to select few rows/records from them based on a matching value for a particular column. I guess I can do this using Filter operation or select operation in a map transformation.

But , i want to update a status column against those rows/records which has not been selected on applying filter.

On applying filter operation , I am getting back in response a new dataframe consisting of matching records.

So, How to know & update the column value of rows which are not selected?

1 Answer 1

1

On applying filter operaiton, you get the new Dataframe cosisting of matching records.

Then, you can use except function in Scala to get the Un-matching records from the input dataframe.

scala> val inputDF = Seq(("a", 1),("b", 2), ("c", 3), ("d", 4), ("e", 5)).toDF("id", "count") inputDF: org.apache.spark.sql.DataFrame = [id: string, count: int] scala> val filterDF = inputDF.filter($"count" > 3) filterDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: string, count: int] scala> filterDF.show() +---+-----+ | id|count| +---+-----+ | d| 4| | e| 5| +---+-----+ scala> val unmatchDF = inputDF.except(filterDF) unmatchDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: string, count: int] scala> unmatchDF.show() +---+-----+ | id|count| +---+-----+ | b| 2| | a| 1| | c| 3| +---+-----+ 

In PySpark you can achieve the same with subtract function.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.