Conditional statement with join

Question

I have two datasets, df1 and df2, where I would like to join the two and then apply a condition that if there are more than one duplicate rows in the host column, take only that one row (to avoid duplicates). I will be joining df1 and df2 ON df1.version = df2.name AND ON df1.date = df2.date

conditions: purpose should = 'hi' or purpose should = 'cat'

df1

version host date pat a16 12/1/2019 fam a16 12/1/2019 emp a16 12/1/2019 dan a16 12/1/2019

df2

name purpose date pat hi 12/1/2019 fam cat 12/1/2019 hello dog 12/1/2019 dan bird 12/1/2019

Here are the join results:

version host date name purpose date pat a16 12/1/2019 pat hi 12/1/2019 fam a16 12/1/2019 fam cat 12/1/2019

DESIRED

version host date name purpose date pat a16 12/1/2019 pat hi 12/1/2019

DOING

select df1.version, df1.host, df1.date, df2.name, df2.purpose, df2.date from df1 left join df2 on df1.version = df2.name AND df1.date = df2.date where df2.purpose = 'hi' OR df2.purpose = 'cat'

I think I have to implement an IF THEN statement within SQL. The above statement only does the join but it does not get rid of the consecutive duplicate host rows. Any suggestion is appreciated

take only the first row ... what defines what the "first" row is here? — Tim Biegeleisen
– Tim Biegeleisen, Commented Jan 21, 2021 at 4:18
In your sample output, yes there are 2 rows with a16 as the host value, but what business logic decides on whether you choose the row with version = pat or the row with version=fam ? That is pretty crucial to suggesting a valid solution. — Martin Cairney
– Martin Cairney, Commented Jan 21, 2021 at 5:03

Venkataraman R · Accepted Answer · 2021-01-21 05:33:29Z

You can apply ROW_NUMBER() function and rank the rows to remove duplicate rows. But, as you are ranking by date, purpose row could either be hi or cat. If you want to specifically pick one of them as first, you have to apply ORDER BY accordingly.

;with cte_df as ( select df1.version, df1.host,df1.date, df2.name, df2.purpose, df2.date as df2date ,row_number() over(partition by df1.host order by df1.date) as rnk FROM df1 inner join df2 on df1.version = df2.name and df1.date = df2.date where df2.purpose in ('hi','cat') ) SELECT * from cte_df WHERE rnk =1

Without_CTE, using derived table

SELECT * FROM ( select df1.version, df1.host,df1.date, df2.name, df2.purpose, df2.date as df2date ,row_number() over(partition by df1.host order by df1.date) as rnk FROM df1 inner join df2 on df1.version = df2.name and df1.date = df2.date where df2.purpose in ('hi','cat') ) AS T WHERE rnk = 1

Thank you, Is there a way to remove rows when there is a duplicate in the 'host' column?
yes. that is what taken care in this query. we are partition by host and generating row_number. So, for the first host row, rank will be 1. Only it is selected in the end
@Lynn, you don't need CTE. You can use derived table. I have added that solution as well to the answer
@Lynn, you have mentioned it as sql server. fiddle is against mySQL

Collectives™ on Stack Overflow

Conditional statement with join

1 Answer 1

9 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Related