I have two datasets, df1 and df2, where I would like to join the two and then apply a condition that if there are more than one duplicate rows in the host column, take only that one row (to avoid duplicates). I will be joining df1 and df2 ON df1.version = df2.name AND ON df1.date = df2.date
conditions: purpose should = 'hi' or purpose should = 'cat'
df1
version host date pat a16 12/1/2019 fam a16 12/1/2019 emp a16 12/1/2019 dan a16 12/1/2019 df2
name purpose date pat hi 12/1/2019 fam cat 12/1/2019 hello dog 12/1/2019 dan bird 12/1/2019 Here are the join results:
version host date name purpose date pat a16 12/1/2019 pat hi 12/1/2019 fam a16 12/1/2019 fam cat 12/1/2019 DESIRED
version host date name purpose date pat a16 12/1/2019 pat hi 12/1/2019 DOING
select df1.version, df1.host, df1.date, df2.name, df2.purpose, df2.date from df1 left join df2 on df1.version = df2.name AND df1.date = df2.date where df2.purpose = 'hi' OR df2.purpose = 'cat' I think I have to implement an IF THEN statement within SQL. The above statement only does the join but it does not get rid of the consecutive duplicate host rows. Any suggestion is appreciated
take only the first row... what defines what the "first" row is here?and row_id =1.a16as the host value, but what business logic decides on whether you choose the row withversion = pator the row withversion=fam? That is pretty crucial to suggesting a valid solution.