Retrieve recent Matches per Player efficiently

Question

I have a website for showing matches of a RTS game.

One of the main use cases is to show recent matches of a player or a group of players.

PostgreSQL 14.13

Table structure:

create table public.match ( match_id integer primary key not null, started timestamp(3) without time zone, ... ); create index "IDX_e7b6cfca8139b9aa85880aab9e" on match using btree (started); create table public.player ( match_id integer not null, profile_id integer not null, slot smallint not null, ..., primary key (match_id, profile_id, slot), foreign key (match_id) references public.match (match_id) match simple on update cascade on delete restrict, foreign key (profile_id) references public.profile (profile_id) match simple on update cascade on delete restrict ); create index idx_player_profile_match on player using btree (profile_id, match_id); create index "IDX_58afd2c450f166eacbdf982841" on player using btree (match_id); create index "IDX_ba3de28aa98207f3a21145feb8" on player using btree (profile_id);

The match table has 50 million records. The player table 200 million records.

There are about 50000 matches with 4 players on average added per day to the database.

The tables currently contain data for the last 3 years. The match table has relation_size 6 GB and total_relation_size 12 GB. The player table 15 GB and 37 GB.

Querying the DB for the most recent 50 matches for a player or for all matches in the last month for a player takes almost a second. After the first execution it is more fast because it is cached but that does not help me as often different users are requested.

-- most recent 50 matches for a player EXPLAIN ANALYSE SELECT * FROM match m JOIN player p on p.match_id = m.match_id WHERE p.profile_id=582058 ORDER BY started desc LIMIT 50; Limit (cost=12075.34..12081.09 rows=50 width=107) (actual time=675.517..682.833 rows=50 loops=1) -> Gather Merge (cost=12075.34..12158.48 rows=723 width=107) (actual time=675.516..682.828 rows=50 loops=1) Workers Planned: 1 Workers Launched: 1 -> Sort (cost=11075.33..11077.14 rows=723 width=107) (actual time=663.726..663.731 rows=38 loops=2) Sort Key: m.started DESC Sort Method: top-N heapsort Memory: 39kB Worker 0: Sort Method: top-N heapsort Memory: 37kB -> Nested Loop (cost=22.66..11051.31 rows=723 width=107) (actual time=3.286..661.017 rows=1882 loops=2) -> Parallel Bitmap Heap Scan on player p (cost=22.09..4853.22 rows=723 width=40) (actual time=2.774..210.891 rows=1882 loops=2) Recheck Cond: (profile_id = 582058) Heap Blocks: exact=1855 " -> Bitmap Index Scan on ""IDX_ba3de28aa98207f3a21145feb8"" (cost=0.00..21.79 rows=1229 width=0) (actual time=4.818..4.819 rows=3772 loops=1)" Index Cond: (profile_id = 582058) -> Index Scan using match_pkey on match m (cost=0.56..8.57 rows=1 width=67) (actual time=0.237..0.237 rows=1 loops=3764) Index Cond: (match_id = p.match_id) Planning Time: 0.305 ms Execution Time: 682.897 ms -- matches in the last month for a player EXPLAIN ANALYSE SELECT * FROM match m JOIN player p on p.match_id = m.match_id WHERE m.started > '2024-09-01' AND p.profile_id=271202 ORDER BY started desc; Gather Merge (cost=12053.93..12057.61 rows=32 width=107) (actual time=797.710..797.881 rows=161 loops=1) Workers Planned: 1 Workers Launched: 1 -> Sort (cost=11053.92..11054.00 rows=32 width=107) (actual time=782.572..782.578 rows=80 loops=2) Sort Key: m.started DESC Sort Method: quicksort Memory: 36kB Worker 0: Sort Method: quicksort Memory: 35kB -> Nested Loop (cost=22.66..11053.12 rows=32 width=107) (actual time=171.842..781.756 rows=80 loops=2) -> Parallel Bitmap Heap Scan on player p (cost=22.09..4853.22 rows=723 width=40) (actual time=1.254..268.523 rows=1682 loops=2) Recheck Cond: (profile_id = 271202) Heap Blocks: exact=1721 " -> Bitmap Index Scan on ""IDX_ba3de28aa98207f3a21145feb8"" (cost=0.00..21.79 rows=1229 width=0) (actual time=1.712..1.712 rows=3378 loops=1)" Index Cond: (profile_id = 271202) -> Index Scan using match_pkey on match m (cost=0.56..8.58 rows=1 width=67) (actual time=0.304..0.304 rows=0 loops=3365) Index Cond: (match_id = p.match_id) Filter: (started > '2024-09-01 00:00:00'::timestamp without time zone) Rows Removed by Filter: 1 Planning Time: 0.446 ms Execution Time: 797.931 ms

When executing this again the execution time drops to about 40ms because of caching.

As far a I understand Postgres first finds all player records for the given profile_id and then orders by started date and then filters by count/started date. This way the query will become more and more inefficient as more matches are added to the database.

My idea is to add the started date of the match to the player table and create an index on profile_id, started.

Do you think this is a good solution? Are there other options?

I also thought about partitioning match/player table by started date but this seems to be a lot of effort.

Update using the answer from Erwin Brandstetter:

Thank you for your advice!

I created the index and filled the started column of the player table:

CREATE INDEX player_profile_id_started_match_id ON player (profile_id, started) INCLUDE (match_id);

The recent matches query for one player is very fast now:

-- get recent 50 matches for one player EXPLAIN ANALYSE SELECT * FROM ( SELECT p.match_id FROM player p WHERE p.profile_id=199325 ORDER BY p.started DESC LIMIT 50 ) p JOIN match m USING (match_id); Nested Loop (cost=1.14..500.47 rows=50 width=67) (actual time=2.721..21.442 rows=50 loops=1) -> Limit (cost=0.57..70.85 rows=50 width=12) (actual time=2.702..14.494 rows=50 loops=1) -> Index Only Scan Backward using player_profile_id_started_match_id on player p (cost=0.57..1739.19 rows=1237 width=12) (actual time=2.701..14.480 rows=50 loops=1) Index Cond: (profile_id = 199325) Heap Fetches: 50 -> Index Scan using match_pkey on match m (cost=0.56..8.58 rows=1 width=67) (actual time=0.138..0.138 rows=1 loops=50) Index Cond: (match_id = p.match_id) Planning Time: 0.222 ms Execution Time: 21.482 ms

I also have the use case of querying the recent matches for multiple players (One or more of the players need to be in the match).

-- get recent 50 matches from multiple players EXPLAIN ANALYSE SELECT * FROM ( SELECT p.match_id FROM player p WHERE p.profile_id IN (1136191, 19455781) ORDER BY p.started DESC LIMIT 50 ) p JOIN match m USING (match_id); Nested Loop (cost=3557.08..3986.26 rows=50 width=67) (actual time=511.692..515.601 rows=50 loops=1) -> Limit (cost=3556.51..3556.64 rows=50 width=12) (actual time=511.658..511.671 rows=50 loops=1) -> Sort (cost=3556.51..3562.70 rows=2473 width=12) (actual time=511.657..511.663 rows=50 loops=1) Sort Key: p.started DESC Sort Method: top-N heapsort Memory: 29kB -> Index Only Scan using player_profile_id_started_match_id on player p (cost=0.57..3474.36 rows=2473 width=12) (actual time=0.592..508.846 rows=5277 loops=1) " Index Cond: (profile_id = ANY ('{1136191,19455781}'::integer[]))" Heap Fetches: 2965 -> Index Scan using match_pkey on match m (cost=0.56..8.58 rows=1 width=67) (actual time=0.078..0.078 rows=1 loops=50) Index Cond: (match_id = p.match_id) Planning Time: 1.090 ms Execution Time: 515.662 ms

In this example it was very slow. But after running VACUUM ANALYZE public.player; which took 7 minutes, it is running fast (about 30ms).

Note: I have not yet updated the foreign key and unique index because the started date will never change after the creation of match and player records.

Your queries select all columns with SELECT *. Which columns do you actually need? Postgres version? SELECT version(); Can there be multiple entries for the same (match_id, profile_id) in table player? (If so, why?) — Erwin Brandstetter
– Erwin Brandstetter, Commented Oct 27, 2024 at 18:04
@ErwinBrandstetter I am using PostgreSQL 14.13. I need all fields of the match and player for display in UI. There can't be duplicate (match_id, profile_id) in table player — Dennis
– Dennis, Commented Oct 28, 2024 at 11:01
Your current query with WHERE p.profile_id IN (1136191, 19455781) finds any matches for the given players. Didn't you want matches that all of the given players played together? That would be a different query, key word "relational division". See: stackoverflow.com/a/7774879/939860. Also, I hope you see index-only scans after VACUUM? That's the whole point of INCLUDE (match_id). — Erwin Brandstetter
– Erwin Brandstetter, Commented Oct 28, 2024 at 12:22
@ErwinBrandstetter For this query I just need one of them to be in the match. It is used so you can follow some players and have a page where you see their recent games. Yes I see index-only scans already before vacuum. But after vacuum they are much faster and the number of heap fetches went down a lot. — Dennis
– Dennis, Commented Oct 28, 2024 at 13:01

Erwin Brandstetter · Accepted Answer · 2024-10-27 19:09:12Z

My idea is to add the started date of the match to the player table and create an index on (profile_id, started).

Yes, that. But type timestamp (not date) to match table match. Plus, add match_id to the index to allow index-only scans. Possibly just as INCLUDE column:

CREATE INDEX player_profile_id_started_match_id ON player (profile_id, started) INCLUDE (match_id);

This adds some unpleasant redundancy. You wouldn't want related started values in both tables to disagree. To enforce integrity you might expand the FK constraint to:

FOREIGN key (started, match_id) REFERENCES public.match (started, match_id)

... which in turn requires a UNIQUE constraint on public.match (started, match_id). I put started first, so the constraint index can now replace the index "IDX_e7b6cfca8139b9aa85880aab9e" on just (started) to avoid additional costs. See:

Is a composite index also good for queries on the first field?

Then optimize your query a bit. LIMIT in a subquery before the join:

SELECT * -- ??? FROM ( SELECT p.match_id -- more ? FROM player p WHERE p.profile_id = 582058 ORDER BY p.started DESC LIMIT 50 ) p JOIN match m USING (match_id)

Or:

SELECT * -- ??? FROM ( SELECT p.match_id -- more ? FROM player p WHERE p.profile_id = 582058 AND p.started >= '2024-09-01' -- surely, you want >= ) p JOIN match m USING (match_id);

Assuming (profile_id, started) is unique. (The same player can't start multiple matches at the same time?) Else, add another column to ORDER BY to break ties.

Aside

Drop the indexes "IDX_ba3de28aa98207f3a21145feb8" and "IDX_58afd2c450f166eacbdf982841". Both barely add anything but write cost over the existing indexes idx_player_profile_match and the PK index. Same logic as for (started, match_id) vs. just (started) above.

Thank you! That makes the query super fast. I have added one follow up question for "fetching matches for multiple players" at the end of my updated question. — Dennis
– Dennis, Commented Oct 28, 2024 at 11:03
Running a VACUUM ANALYZE public.player; fixed the issue when querying for multiple players :) — Dennis
– Dennis, Commented Oct 28, 2024 at 12:14

alexius · Accepted Answer · 2024-10-27 14:33:40Z

For the second query index on match table on (match_id, started) should help (if planner will use it) because with it we'll read less data from table that will be filtered away later.

If that's not enough then implementing your idea (with additional column and index) will help much more for both queries.

Stack Exchange Network

Retrieve recent Matches per Player efficiently

2 Answers 2

Aside

Linked

Hot Network Questions

Retrieve recent Matches per Player efficiently

2 Answers 2

Aside

Linked

Related

Hot Network Questions