0

I have two datasets, df1 and df2, where I would like to join the two and then apply a condition that if there are more than one duplicate rows in the host column, take only that one row (to avoid duplicates). I will be joining df1 and df2 ON df1.version = df2.name AND ON df1.date = df2.date

conditions: purpose should = 'hi' or purpose should = 'cat'

df1

version host date pat a16 12/1/2019 fam a16 12/1/2019 emp a16 12/1/2019 dan a16 12/1/2019 

df2

name purpose date pat hi 12/1/2019 fam cat 12/1/2019 hello dog 12/1/2019 dan bird 12/1/2019 

Here are the join results:

version host date name purpose date pat a16 12/1/2019 pat hi 12/1/2019 fam a16 12/1/2019 fam cat 12/1/2019 

DESIRED

version host date name purpose date pat a16 12/1/2019 pat hi 12/1/2019 

DOING

select df1.version, df1.host, df1.date, df2.name, df2.purpose, df2.date from df1 left join df2 on df1.version = df2.name AND df1.date = df2.date where df2.purpose = 'hi' OR df2.purpose = 'cat' 

I think I have to implement an IF THEN statement within SQL. The above statement only does the join but it does not get rid of the consecutive duplicate host rows. Any suggestion is appreciated

6
  • take only the first row ... what defines what the "first" row is here? Commented Jan 21, 2021 at 4:18
  • 1
    add and row_id =1. Commented Jan 21, 2021 at 4:22
  • What do you mean by duplicate row ? Commented Jan 21, 2021 at 4:30
  • I wish to remove any duplicate rows from the 'host' column Commented Jan 21, 2021 at 4:35
  • In your sample output, yes there are 2 rows with a16 as the host value, but what business logic decides on whether you choose the row with version = pat or the row with version=fam ? That is pretty crucial to suggesting a valid solution. Commented Jan 21, 2021 at 5:03

1 Answer 1

2

You can apply ROW_NUMBER() function and rank the rows to remove duplicate rows. But, as you are ranking by date, purpose row could either be hi or cat. If you want to specifically pick one of them as first, you have to apply ORDER BY accordingly.

;with cte_df as ( select df1.version, df1.host,df1.date, df2.name, df2.purpose, df2.date as df2date ,row_number() over(partition by df1.host order by df1.date) as rnk FROM df1 inner join df2 on df1.version = df2.name and df1.date = df2.date where df2.purpose in ('hi','cat') ) SELECT * from cte_df WHERE rnk =1 

Without_CTE, using derived table

SELECT * FROM ( select df1.version, df1.host,df1.date, df2.name, df2.purpose, df2.date as df2date ,row_number() over(partition by df1.host order by df1.date) as rnk FROM df1 inner join df2 on df1.version = df2.name and df1.date = df2.date where df2.purpose in ('hi','cat') ) AS T WHERE rnk = 1 
Sign up to request clarification or add additional context in comments.

9 Comments

Thank you, Is there a way to remove rows when there is a duplicate in the 'host' column?
yes. that is what taken care in this query. we are partition by host and generating row_number. So, for the first host row, rank will be 1. Only it is selected in the end
@Lynn, you don't need CTE. You can use derived table. I have added that solution as well to the answer
Share your fiddle link so that we can help you on it
@Lynn, you have mentioned it as sql server. fiddle is against mySQL
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.