2

I have the following DataFrame df:

How can I delete duplicates, while keeping the minimum value of level per each duplicated pair of item_id and country_id.

+-----------+----------+---------------+ |item_id |country_id|level | +-----------+----------+---------------+ | 312330| 13535670| 82| | 312330| 13535670| 369| | 312330| 13535670| 376| | 319840| 69731210| 127| | 319840| 69730600| 526| | 311480| 69628930| 150| | 311480| 69628930| 138| | 311480| 69628930| 405| +-----------+----------+---------------+ 

The expected output:

+-----------+----------+---------------+ |item_id |country_id|level | +-----------+----------+---------------+ | 312330| 13535670| 82| | 319840| 69731210| 127| | 319840| 69730600| 526| | 311480| 69628930| 138| +-----------+----------+---------------+ 

I know how to delete duplicates without conditions using dropDuplicates, but I don't know how to do it for my particular case.

0

1 Answer 1

10

One of the method is to use orderBy (default is ascending order), groupBy and aggregation first

import org.apache.spark.sql.functions.first df.orderBy("level").groupBy("item_id", "country_id").agg(first("level").as("level")).show(false) 

You can define the order as well by using .asc for ascending and .desc for descending as below

df.orderBy($"level".asc).groupBy("item_id", "country_id").agg(first("level").as("level")).show(false) 

And you can do the operation using window and row_number function too as below

import org.apache.spark.sql.expressions.Window val windowSpec = Window.partitionBy("item_id", "country_id").orderBy($"level".asc) import org.apache.spark.sql.functions.row_number df.withColumn("rank", row_number().over(windowSpec)).filter($"rank" === 1).drop("rank").show() 
Sign up to request clarification or add additional context in comments.

7 Comments

The first method will take the minimum value of level, not the maximum, right?
first will take the first row of the grouping. If the ordering is ascending then its the minimum and if its descending then the maximum.
why would somebody downvote without even commenting. Please comment the drawbacks if you really want to downvote so that I can improve the answer and if the answer is inappropriate then I shall delete it. I just don't understand why people downvote even without commenting.
I didn't downvote, but are you sure orderBy with first is guaranteed to work? In my experience it doesn't always do what you might expect in a distributed setup.
In the meantime I saw that the accepted answer to the question which this one duplicates says exactly the same as me, so I tend to agree with the downvoter that this answer is party incorrect.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.