Select values that are not present in a postgres column (without upper limit)

Question

Given a postgres table foo that contains an external_id BIGINT column which has some gaps. Is there a way, without looping or using an excessively large generate_sequence to exclude from or join against, to get n values that are not present in said column?

Let's say it has rows with external_ids 1, 3, 5, 7 and I want it to return 5 sequential values, it should return 2, 4, 6, 8, 9. The next select (given that nothing changed) should return 10, 11, 12, 13, 14 etc. So both gaps and continuously rising, sequential ids should be returned. I could do something like

WITH id_series AS ( SELECT generate_series(1, 10) AS ids ) SELECT id_series.ids FROM id_series left join known_ids ki on id_series.ids = kid.id where ki.id is null limit 5;

this however might not return enough unused ids if the sequence isn't big enough for it to find unused ones, having to loop for as long as 5 ids have been returned.

My table currently contains ids ranging between 50,000 and 1,800,000,000. Some gaps are a few million ids wide, some others ranges don't have a gap for another few millions (potentially requiring to loop many many times or using a very large sequence)

Is there any clever & efficient way to solve this?

Here's the reason I'm doing this. The table contains the external_ids from an external data source. this data source has to be queried with individual ids. the data source exposes its entities over time. so id 1 billion might be "public" before id 5. so I need to know for which external ids I didn't get a response YET and query them periodically. As I said before, it's something I can't change because I don't control the external data source. It's an actual business requirement I cannot change.

Any number of reasons why this might not be feasible--depends on why you have gaps and how they get filled--but have you considered building a table containing the list of "gap" Ids, and then picking/deleting from there to fill in the gaps? It'd take the "big query" to fill, but once done it's done. — Philip Kelley
– Philip Kelley, Commented Sep 13, 2023 at 21:23
The most effective/efficient solution is do nothing, it is not a problem. Accept that gaps are normal and expected. Also consider the column name external_id implies you do not control the value. So why are you manufacturing values? Further, if you are generating the values, it is still not a problem. Starting the highest value you current have (1,800,000,000) and you generate 1,000,000 values per second then you have 290,000 years before you exceed the maximum for a bigint. — Belayer
– Belayer, Commented Sep 13, 2023 at 22:37
because I need to know the gaps and query the external source for the gaps periodically @Belayer it's not a matter of data hygiene and the gaps are not a "problem", they simply are part of my problem domain — Pascal
– Pascal, Commented Sep 13, 2023 at 22:47
What is actually the problem? I see that in the range 1, 3, 5 there are no numbers 2 and 4, but I don't see the problem. The numbers 1.1 and 2.8 are also not there. And "id" stands for "identifier" and the number 1 does the same job as number 3, even when number 2 is not there. — Frank Heikens
– Frank Heikens, Commented Sep 13, 2023 at 22:59
@PhilipKelley In that case, as Philip suggested, also keep a table with the ids (or id ranges) that are still missing and that you want to query. Store for each range when it was last queried, what the response was, and when it should be scanned next. When you do get a response, delete/split the range to remove those where you found an entity. — Bergi
– Bergi, Commented Sep 13, 2023 at 23:40

Bergi · Accepted Answer · 2023-09-13 23:15:32Z

Your query has to scan the entire table anyway. I don't think it would be too inefficient to use something like

SELECT generate_series( (SELECT MIN(external_id) FROM known), (SELECT MAX(external_id) FROM known) ) EXCEPT SELECT external_id FROM known

Sure, this will not return ids below or above the existing ids, and might generate less ids than the n values you need, but you could easily handle that in your application logic. Or use generate_series(1, max(external_id)+n).

If you don't actually need individual ids, but rather the range of each gap, look into gaps-and-islands queries such as Find gaps of a sequence in SQL without creating additional tables.

If you don't need all gaps at once, but only the next ones, with no more than n values, you can write an iterative algorithm to find gap after gap (until you have enough values) using a recursive CTE.

EXCEPT is super slow for me, however, a left join where nothing is found in the known table is crazy fast when used with a limit (which is fine because I'm querying in batches anyways and don't need all ids at once). I guess because generate_series is lazy enough. It only gets slow when an offset is used but this can be worked around by doing the offset via the lower bound of the generate_series call. IDK why I was under the impression that this would be slow.. Thanks for actually giving an answer :)

Collectives™ on Stack Overflow

Select values that are not present in a postgres column (without upper limit)

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related