Duplicate Analysis

Question

I'm attempting to write some logic with Oracle SQL, but am having a hard time getting it right. First, I need my script to identify duplicated items. Then determine, of the duplicated items, which is the most recent. The database I'm working with suffers from a high number of manual data inserts outside of the application. This causes items to appear out of sequence when using the ID number (I am using the start date and the ID number as a way to measure sequence, as the table has no other method to do so).

If I need to determine the most current role for Employee 12311, how would I do so?

Here is what I have so far:

Table

 ID | EMPLOYEE | ROLE | STARTDATE | -----|----------|------------|-----------------------| 3432 | 12311 | Supervisor | 2016-07-12T00:00:00Z | 3421 | 12311 | Analyst | 2016-07-12T00:00:00Z | 4321 | 12311 | Help Desk | 2014-05-12T00:00:00Z | 5432 | 23432 | Manager | 2012-11-02T00:00:00Z | 3452 | 23432 | Associate | 2011-04-23T00:00:00Z | 7652 | 54332 | Analyst | 2015-10-15T00:00:00Z | 5691 | 54332 | Assistant | 2013-10-15T00:00:00Z |

Code

SELECT "ID", "EMPLOYEE", "ROLE", "STARTDATE", (CASE WHEN RANK() OVER (PARTITION BY "EMPLOYEE" ORDER BY "STARTDATE" DESC) = 1 THEN 1 ELSE 0 END) "MAX DATE", (CASE WHEN RANK() OVER (PARTITION BY "EMPLOYEE" ORDER BY "ID" DESC) = 1 THEN 1 ELSE 0 END) "MAX ID" FROM ( SELECT DISTINCT EMPLOYEE "E.EMPLOYEE", E.ID "ID", LR.DESCRIPTION "ROLE", ROLE_START_DATE "STARTDATE" FROM EMPLOYEES E JOIN ROLES R ON E.EMPLOYEE_ID = R.EMPLOYEE_ID JOIN LU_ROLES LR ON R.ROLE_ID = LR.ROLE_ID WHERE ROLE_START_DATE <= DATE '2017-12-03') ORDER BY 2

Results

 ID | EMPLOYEE | ROLE | STARTDATE | MAX Date | Max ID -----|----------|------------|----------------------------------|-------- 3432 | 12311 | Supervisor | 2016-07-12T00:00:00Z | 1 | 0 3421 | 12311 | Analyst | 2016-07-12T00:00:00Z | 1 | 0 4321 | 12311 | Help Desk | 2014-05-12T00:00:00Z | 0 | 1 5432 | 23432 | Manager | 2012-11-02T00:00:00Z | 1 | 1 3452 | 23432 | Associate | 2011-04-23T00:00:00Z | 0 | 0 7652 | 54332 | Analyst | 2015-10-15T00:00:00Z | 1 | 1 5691 | 54332 | Assistant | 2013-10-15T00:00:00Z | 0 | 0

Rather than looking at all of the records for each employee and determining the most recent record, I'd like for the script to only use the duplicated STARTDATES.

Basically, if the most recent STARTDATE is duplicated, identify which ID is the highest.

So it should look like this:

 ID | EMPLOYEE | ROLE | STARTDATE | MAX Date | Max ID -----|----------|------------|----------------------------------|-------- 3432 | 12311 | Supervisor | 2016-07-12T00:00:00Z | 1 | 1 3421 | 12311 | Analyst | 2016-07-12T00:00:00Z | 1 | 0 4321 | 12311 | Help Desk | 2014-05-12T00:00:00Z | 0 | 0 5432 | 23432 | Manager | 2012-11-02T00:00:00Z | 1 | 1 3452 | 23432 | Associate | 2011-04-23T00:00:00Z | 0 | 0 7652 | 54332 | Analyst | 2015-10-15T00:00:00Z | 1 | 1 5691 | 54332 | Assistant | 2013-10-15T00:00:00Z | 0 | 0

I am completely open to better ways of doing this. Any help you could provide would be much appreciated!

EDIT WITH SOLUTION:

Thanks to @Littlefoot for this. I was able to modify my script to include the following:

 SELECT "ID", "EMPLOYEE", "ROLE", "STARTDATE", ROW_NUMBER() OVER (PARTITION BY "EMPLOYEE" ORDER BY "STARTDATE" DESC, "ID" DESC) RN FROM ( SELECT DISTINCT EMPLOYEE "E.EMPLOYEE", E.ID "ID", LR.DESCRIPTION "ROLE", ROLE_START_DATE "STARTDATE" FROM EMPLOYEES E JOIN ROLES R ON E.EMPLOYEE_ID = R.EMPLOYEE_ID JOIN LU_ROLES LR ON R.ROLE_ID = LR.ROLE_ID WHERE ROLE_START_DATE <= DATE '2017-12-03') ORDER BY 2

RESULTS:

 ID | EMPLOYEE | ROLE | STARTDATE | RN | -----|----------|------------|----------------------------------| 3432 | 12311 | Supervisor | 2016-07-12T00:00:00Z | 1 | 3421 | 12311 | Analyst | 2016-07-12T00:00:00Z | 2 | 4321 | 12311 | Help Desk | 2014-05-12T00:00:00Z | 3 | 5432 | 23432 | Manager | 2012-11-02T00:00:00Z | 1 | 3452 | 23432 | Associate | 2011-04-23T00:00:00Z | 2 | 7652 | 54332 | Analyst | 2015-10-15T00:00:00Z | 1 | 5691 | 54332 | Assistant | 2013-10-15T00:00:00Z | 2 |

I then filtered my results by RN=1

Littlefoot · Accepted Answer · 2018-07-31 16:02:08Z

If I need to determine the most current role for Employee 12311, how would I do so?

The one whose RN is the lowest? Why do you need two MAX columns, when one does the job itself? Such as:

SQL> with test (id, empid, role, startdate) as 2 (select 3432, 12311, 'supervisor', date '2016-07-12' from dual union 3 select 3421, 12311, 'analyst' , date '2016-07-12' from dual union 4 select 4321, 12311, 'help desk' , date '2014-05-12' from dual union 5 -- 6 select 5432, 23432, 'manager' , date '2012-11-02' from dual union 7 select 3452, 23432, 'associate' , date '2011-04-23' from dual 8 ) 9 select id, empid, role, startdate, 10 row_number() over (partition by empid order by startdate desc, id desc) rn 11 from test; ID EMPID ROLE STARTDATE RN ---------- ---------- ---------- ---------- ---------- 3432 12311 supervisor 2016-07-12 1 3421 12311 analyst 2016-07-12 2 4321 12311 help desk 2014-05-12 3 5432 23432 manager 2012-11-02 1 3452 23432 associate 2011-04-23 2 SQL>

That query would be a source for yet another one, which uses a WHERE clause, i.e.

 <snip> 9 select id, empid, role, startdate 10 from (select id, empid, role, startdate, 11 row_number() over (partition by empid order by startdate desc, id desc) rn 12 from test 13 ) 14 where rn = 1; ID EMPID ROLE STARTDATE ---------- ---------- ---------- ---------- 3432 12311 supervisor 2016-07-12 5432 23432 manager 2012-11-02 SQL>

Thanks for this! I needed it to be a bit more scalable, so I pulled it out of the nested select statement. It seems to be working so far. I like this because it isnt simply 1, 0 - its ranked. SELECT "ID", "EMPLOYEE", "ROLE", "STARTDATE", row_number() over (partition by empid order by startdate desc, id desc) rn FROM ( SELECT DISTINCT EMPLOYEE "E.EMPLOYEE"..... I'm still trying a few others to see which is the most efficient. But this is great! Thanks again!
You're welcome; I'm glad if it helped. If I may suggest: get rid of a (bad) habit naming Oracle objects (and columns) using double quotes. It brings nothing but problems. By default, they are all created in uppercase, but you can reference them any way you want. But, if enclosed into double quotes, you always have to follow (lower/upper/mixed) case you used while creating those objects.
I've run into that very issue recently, actually. Thanks for the tip!

Gordon Linoff · Accepted Answer · 2018-07-31 16:09:51Z

I would use keep:

SELECT EMPLOYEE as "E.EMPLOYEE", E.ID as "ID", MAX(LR.DESCRIPTION) KEEP (DENSE_RANK FIRST ORDER BY ROLE_START_DATE DESC) as "ROLE", MAX(ROLE_START_DATE) as "STARTDATE" FROM EMPLOYEES E JOIN ROLES R ON E.EMPLOYEE_ID = R.EMPLOYEE_ID JOIN LU_ROLES LR ON R.ROLE_ID = LR.ROLE_ID WHERE ROLE_START_DATE <= DATE '2017-12-03' GROUP BY EMPLOYEE;

Alex Poole · Accepted Answer · 2018-07-31 16:16:28Z

You can use the max() aggregate with the last() function to do this in one step; in simplified form:

select employee, max(role) keep (dense_rank last order by startdate, id) as role from employees group by employee

This uses both the startdate and the id to find the 'latest' role; the id is only relevant if there is a tie on startdate.

Demo with your sample data in a CTE:

with employees (ID, EMPLOYEE, ROLE, STARTDATE) as ( select 3432, 12311, 'Supervisor', timestamp '2016-07-12 00:00:00 UTC' from dual union all select 3421, 12311, 'Analyst', timestamp '2016-07-12 00:00:00 UTC' from dual union all select 4321, 12311, 'Help Desk', timestamp '2014-05-12 00:00:00 UTC' from dual union all select 5432, 23432, 'Manager', timestamp '2012-11-02 00:00:00 UTC' from dual union all select 3452, 23432, 'Associate', timestamp '2011-04-23 00:00:00 UTC' from dual union all select 7652, 54332, 'Analyst', timestamp '2015-10-15 00:00:00 UTC' from dual union all select 5691, 54332, 'Assistant', timestamp '2013-10-15 00:00:00 UTC' from dual ) select employee, max(role) keep (dense_rank last order by startdate, id) as role from employees group by employee order by employee; EMPLOYEE ROLE ---------- ---------- 12311 Supervisor 23432 Manager 54332 Analyst

You can use the same functions with your joined tables, without needing to manually calculate the rankings.

Collectives™ on Stack Overflow

Duplicate Analysis

3 Answers 3

3 Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Related