2

I have a postgress table incoming_records which contains 2 column supervisor_id, emp_id

create table incoming_records(supervisor_id,emp_id)as values (null,01) --top level resources with blank supervisor id ,(01,02) ,(02,03) ,(03,04) ,(04,05) ,(05,06) ,(06,07) ,(07,08) ,(null,10) -- supervisor with 3 employees directly under them ,(10,11) ,(10,12) ,(10,13) ,(14,14) -- loop length 1 ,(15,16) ,(16,15) -- loop length 2 ,(17,18) ,(18,19) ,(19,17) -- loop length 3 ,(20,21) ,(21,22) ,(22,23) ,(23,20) -- loop length 4 ,(null,24) -- supervisor with no employees ,(null,25) -- fork, then an employee with 2 supervisors ,(25,26),(26,28) ,(25,27),(27,28) ,(null,29) -- supervisor with a null employee ,(29,null) ,(null,null) --nobody working for nobody ,(99,30) -- somebody working for someone who doesn't exist ,(99,null); -- nobody working for someone who doesn't exist 

This table receives data from an external application and gets fully updated every day and it holds around 60000 records. On basis of these records I need to build an hierarchy of resources. To build the hierarchy I wrote a recursive query and it works fine. But sometime my recursive query goes to infinite loop due to wrong data. Take below scenario where a circular dependency is being created. In this case recursive query goes to infinite loop and it never ends.

supervisor_id emp_id
rt08 rt09
rt09 rt08

I am trying to write a query(written below) which can figure out these faulty records. But till now I did not succeed.

WITH recursive tmp AS ( SELECT supervisor_id, emp_id FROM incoming_records UNION ALL SELECT tmp.supervisor_id, t.emp_id FROM tmp INNER JOIN incoming_records AS t ON t.supervisor_id = tmp.emp_id WHERE t.supervisor_id = tmp.emp_id) select * from tmp 

Any suggestion will be appreciated

10
  • Are you using Postgresql or Oracle (PL/SQL)? Commented Apr 4 at 13:01
  • 1
    Can you add some more sample table data, and also specify the expected result, i.e. provide a complete minimal reproducible example. Commented Apr 4 at 13:03
  • @AD7six I just want to filter those records which are creating circular dependencies. Commented Apr 4 at 13:17
  • 1
    @jarlh added more data. I hope it is more clear now. Commented Apr 4 at 13:19
  • 1
    @GD_Java I added more examples to your sample set - let me know if you allow such cases. Both types of forks: upward (1 employee with 2 supervisors) and downward (1 supervisors with 2 employees) affect how this should be approached. Employee 10 is already a supervisor of 3 others, so you allow downward forks. If you don't allow upward ones, it's best to alter table incoming_records add unique(emp_id); and edit that in. Commented Apr 4 at 16:32

4 Answers 4

3

Cycle detection is a built-in feature of recursive CTEs: demo at db<>fiddle

  1. CYCLE clause informs it which value indicates a loop if it repeats
  2. It's followed by a boolean field, populated for the records with loops, then an array field that holds the whole path/group with the cycle.
  3. Since the third field path accumulates all keys, it can grow quickly on larger and more complex sets. As an optimised, fast and memory-efficient alternative, consider pgrouting.
WITH recursive tmp AS ( SELECT supervisor_id, emp_id FROM incoming_records UNION ALL SELECT tmp.supervisor_id, t.emp_id FROM tmp INNER JOIN incoming_records AS t ON t.supervisor_id = tmp.emp_id )CYCLE emp_id SET is_cycle USING path ----- here SELECT * FROM tmp WHERE is_cycle; 
supervisor_id emp_id is_cycle path
rt14 rt14 t {(rt14),(rt14)}
rt15 rt16 t {(rt16),(rt15),(rt16)}
rt16 rt15 t {(rt15),(rt16),(rt15)}
rt17 rt18 t {(rt18),(rt19),(rt17),(rt18)}
rt18 rt19 t {(rt19),(rt17),(rt18),(rt19)}
rt19 rt17 t {(rt17),(rt18),(rt19),(rt17)}
rt20 rt21 t {(rt21),(rt22),(rt23),(rt20),(rt21)}
rt21 rt22 t {(rt22),(rt23),(rt20),(rt21),(rt22)}
rt22 rt23 t {(rt23),(rt20),(rt21),(rt22),(rt23)}
rt23 rt20 t {(rt20),(rt21),(rt22),(rt23),(rt20)}

If you want to avoid them rather than select only those, switch to WHERE NOT is_cycle;.

The SQL-standard CYCLE clause for built-in cycle detection was added in v14.

Sign up to request clarification or add additional context in comments.

Comments

1

Displaying as much as possible until reaching the loop

Instead of identifying the loops in a separate task,
you can get a failsafe walkthrough by remembering the path to reach each item, and stopping when you encounter an already seen employee.

PostgreSQL's array is the right structure to hold it:

WITH recursive tmp AS ( -- We're starting from the top, so no need of supervisor_id anymore: SELECT emp_id, array[emp_id] path -- ← Here init our path. FROM incoming_records WHERE supervisor_id IS NULL UNION ALL SELECT t.emp_id, path||t.emp_id -- ← Here append the current employee. FROM tmp INNER JOIN incoming_records AS t ON t.supervisor_id = tmp.emp_id WHERE t.supervisor_id = tmp.emp_id AND NOT t.emp_id = ANY(path) -- ← Here ensure the new employee is not a supervisor (already in the path) of this one. ) select * from tmp; 

(see it running in a fiddle)

A 2-steps job (first identifying the loops, then computing the tree when there is no more loop) would anyway require computing the path during the recursive CTE,
but would:

  • block your tool until you manually delete one of the culprits
  • hamper your diagnosis (if you start from the bottom as you tried, you will not be able to go up to the top and see where in the hierarchy your two candidates would approximately be)

With culprit display

You can even display the looping entry while tagging instead of filtering out (… and then on next iteration, stop the tagged entry of course):

WITH recursive tmp AS ( SELECT emp_id, array[emp_id] path, FALSE looping FROM incoming_records WHERE supervisor_id IS NULL UNION ALL SELECT t.emp_id, path||t.emp_id, t.emp_id = ANY(path) -- ← If detecting a loop, SELECT but tag. FROM tmp INNER JOIN incoming_records AS t ON t.supervisor_id = tmp.emp_id WHERE t.supervisor_id = tmp.emp_id AND NOT looping -- ← If last iteration was tagged, do not look for its subordinates anymore. ) select * from tmp; 
emp_id path looping
rt01 {rt01} f
rt10 {rt10} f
rt02 {rt01,rt02} f
rt11 {rt10,rt11} f
rt12 {rt10,rt12} f
rt13 {rt10,rt13} f
rt03 {rt01,rt02,rt03} f
rt04 {rt01,rt02,rt03,rt04} f
rt05 {rt01,rt02,rt03,rt04,rt05} f
rt06 {rt01,rt02,rt03,rt04,rt05,rt06} f
rt07 {rt01,rt02,rt03,rt04,rt05,rt06,rt07} f
rt08 {rt01,rt02,rt03,rt04,rt05,rt06,rt07,rt08} f
rt09 {rt01,rt02,rt03,rt04,rt05,rt06,rt07,rt08,rt09} f
rt08 {rt01,rt02,rt03,rt04,rt05,rt06,rt07,rt08,rt09,rt08} t

Then you can use it both in your production query (where you will filter out with a WHERE NOT looping to keep only one rt08) and in your manual diagnosing one.

Limits

Duplicates

Note that there still may be duplicates if you have two ways to enter the loop.

For example if both 8 and 9 are under 7, and one under another,
you'll get 9 both as 7 > 9 and 7 > 8 > 9 (which is not a loop, as 9 still isn't in the path; there will be a loop on next iteration, with 7 > 8 > 9 > 8).

But this isn't specific to loops, here it is a problem of graph.

You can prevent this by a choosing a DISTINCT ON (emp_id) ORDER BY ARRAY_LENGTH(path, 1) DESC (this elects the shortest path to the employee, but if 2 path lead to the same employee (for example if 3 > 5 and 4 > 5) one will be elected arbitrarily; and subordinates of this uncertain one will not necessarily choose the same path, thus you may end with 3 > 5 for 5 but 4 > 5 > 6 for 6).

No point of entry

This top-to-bottom requires your looping elements to have multiple parents, with at least one reachable from a no-supervisored employee.
A pure loop will never appear in the results. (e.g.: if 8 > 9 and 9 > 8 loop together with no 7 > 8 to connect them to the rest of the world).

In this case you will not have other choice than having a separate "pure loop detector", as seen in my dedicated answer.

Comments

1

pgr_connectedComponents()

Compared to roll-your-own recursive CTEs, functions that come in pgrouting extension run faster, using less memory, thanks to years of Boost C++ development behind it.

On sample sets with a few thousand rows it runs a few times faster than all alternatives here and it's the only one applicable above that size. The recursive CTEs run out of memory, leak to disk, then run out of space in the lower thousands. Here's a test on 7.5k rows where it takes 143ms, while pgrouting finishes in 31ms.
If you increase the sample size above that, db<>fiddle runs out of resources necessary to keep the recursive CTE example alive.

On 50k samples, pgrouting example below finds 3899 rows from groups with loops in 170ms:
demo at db<>fiddle

with trees as( select supervisor_id , emp_id , every(supervisor_id is not null)over w1 as has_cycle from pgr_connectedComponents( $x$ select row_number()over() as id , coalesce(emp_id,supervisor_id,1e7::int) as source , coalesce(supervisor_id,emp_id,1e7::int) as target , 1 as cost , 1 as reverse_cost from incoming_records $x$) as c(seq,subgraph,emp_id) join incoming_records using(emp_id) window w1 as (partition by subgraph) ) select supervisor_id , emp_id from trees where has_cycle; 

Note that this assumes you can't make an employee their own supervisor and that each employee can have at most one supervisor at a time. In this scenario, it's enough to establish the top, null-pointing supervisor is missing to determine a tree contains a loop.

Comments

0

Detecting pure loops

If you have pure loops in closed circuit (with no way to go up to a head of hierarchy with no supervisor),
my display as much as possible approach won't work, because it requires a head of hierarchy as entry point.

In that case you can still use PostgreSQL arrays to construct the bottom-to-top path for each employee,
and stop as soon as you encounter an already seen employee:

WITH recursive tmp AS ( SELECT supervisor_id, emp_id, array[]::text[] path -- ← Start with an empty path beyond each employee. FROM incoming_records UNION ALL SELECT t.supervisor_id, t.emp_id, tmp.emp_id||path -- ← Append the employee where we came from to the path. FROM tmp INNER JOIN incoming_records AS t ON tmp.supervisor_id = t.emp_id AND NOT tmp.emp_id = ANY(path) -- ← Do not continue for entries having appeared in their own path. ) select * from tmp where emp_id = ANY(path); -- ← And show only those, not the intermediates who succeeded into reaching a head of hierarchy without loop. 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.