PostgreSQL: Correct way for subset calculations

Question

I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id and activity_id like this

user_id | activity_id --------------------- 1 | 1 1 | 2 1 | 3 2 | 1 2 | 2

I want to select all user_id with activity_id = 1, 2 NOT 3. In the case above it will be just one result: user_id = 2. I can do it using SELECT DISTINCT combined with INTERSECT and EXCEPT operators, but it seems to be extremely slow.

From what I know about databases, it can be improved with GIN and table partitioning, however I feel like it's not correct solution in the case of PostgreSQL (because subsets operators are slow by their own).

Please post your table(s) definition(s), query, and execution plan. — Colin 't Hart
– Colin 't Hart, Commented Jan 14, 2019 at 14:46
@a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows. — Ximik
– Ximik, Commented Jan 14, 2019 at 15:07
When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2? — Evan Carroll
– Evan Carroll, Commented Jan 14, 2019 at 16:55

score 5 · Accepted Answer · 2019-01-14 13:53:10Z

You can easily do this with arrays in Postgres:

select user_id, array_agg(activity_id) as activities from users group by user_id having array_agg(activity_id) @> array[1,2] and not 3 = any(array_agg(activity_id));

The condition array_agg(activity_id) @> array[1,2] only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id)) removes all those that contain activity_id = 3

If the table contains more than just those two columns, an index on (user_id, activitiy_id) will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id) to enable Postgres to remove unwanted rows efficiently.

On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.

Online example: https://rextester.com/YLN7221

sticky bit · Accepted Answer · 2019-01-14 13:19:46Z

You can first try to rewrite the query using EXISTS and a regular (B-tree) index on user_id and activity_id.

CREATE INDEX elbat_user_id_activity_id ON elbat (user_id, activity_id);

SELECT DISTINCT t1.user_id FROM elbat t1 WHERE EXISTS (SELECT * FROM elbat t2 WHERE t2.user_id = t1.user_id AND t2.activity_id = '1') AND EXISTS (SELECT * FROM elbat t2 WHERE t2.user_id = t1.user_id AND t2.activity_id = '2') AND NOT EXISTS (SELECT * FROM elbat t2 WHERE t2.user_id = t1.user_id AND t2.activity_id = '3');

If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.

Thank you for this answer, seems like having index on two columns gives great perfomance boost. But the answer I've selected still seems to be more "correct". One more time, thanks. — Ximik
– Ximik, Commented Jan 18, 2019 at 14:52

Evan Carroll · Accepted Answer · 2019-01-15 14:59:50Z

You can use bool_or for this.

SELECT user_id FROM users GROUP BY user_id HAVING bool_or(activity_id IN (1,2)) -- assumes '1, 2' mean '1 OR 2' AND NOT bool_or(activity_id IN (3));

Stack Exchange Network

PostgreSQL: Correct way for subset calculations

3 Answers 3

Hot Network Questions

PostgreSQL: Correct way for subset calculations

3 Answers 3

Related

Hot Network Questions