How can it be possible that sum operation is so much faster than count?

Question

When I make a simple benchmark test in a table that has 11.065.763 records. I get the results below:

select sum(t.amount * t.exchange) from table t; Finalize Aggregate (cost=658027.39..658027.40 rows=1 width=8) (actual time=2391.248..2391.248 rows=1 loops=1) Buffers: shared hit=25366 read=550786 dirtied=18 written=18 I/O Timings: read=1978.454 write=0.205 -> Gather (cost=658027.17..658027.38 rows=2 width=8) (actual time=2391.100..2391.229 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=25366 read=550786 dirtied=18 written=18 I/O Timings: read=1978.454 write=0.205 -> Partial Aggregate (cost=657027.17..657027.18 rows=1 width=8) (actual time=2377.613..2377.613 rows=1 loops=3) Buffers: shared hit=24961 read=550759 dirtied=18 written=18 I/O Timings: read=1977.930 write=0.205 -> Parallel Seq Scan on odeme_kaydi ok (cost=0.00..622181.24 rows=4646124 width=16) (actual time=0.084..1972.061 rows=3688816 loops=3) Buffers: shared hit=24961 read=550759 dirtied=18 written=18 I/O Timings: read=1977.930 write=0.205 Planning time: 0.279 ms Execution time: 2408.745 ms select count(t.id) from table t; Aggregate (cost=489270.14..489270.15 rows=1 width=8) (actual time=12256.560..12256.560 rows=1 loops=1) Buffers: shared hit=6902688 read=372054 dirtied=32 I/O Timings: read=3067.841 -> Index Only Scan using pk_odeme_kaydi_id on odeme_kaydi ok (cost=0.43..461393.25 rows=11150756 width=8) (actual time=0.169..11583.174 rows=11066478 loops=1) Heap Fetches: 4085161 Buffers: shared hit=6902688 read=372054 dirtied=32 I/O Timings: read=3067.841 Planning time: 0.110 ms Execution time: 12256.609 ms

Note: amount field is always greater than 0 in the table. Postgresql version is 9.5.

Is it normal or is there anything tricky?

@Akina for select count(*) result is : Planning time: 0.964 ms Execution time: 10891.418 ms — H.Ç.T
– H.Ç.T, Commented Dec 6, 2018 at 13:31
It's too strange that counting the whole records count needs some scan instead of take the result from table statistic... and the same about t.id which is a primary index I assume... — Akina
– Akina, Commented Dec 6, 2018 at 13:59
@Akina: the count has to go through all rows. "table statistics" are only estimates to begin with, but if they weren't the count has to do honor concurrent transactions and thus it has to go through all rows. Everything else wouldn't be an accurate answer — user1822
– user1822, Commented Dec 6, 2018 at 14:25

Erwin Brandstetter · Accepted Answer · 2018-12-06 14:02:55Z

Typically, there is not much difference between sum() and count(), the number of data pages that have to be read are the dominant factor for performance, but count() is generally faster, especially count(*):

For absolute performance, is SUM faster or COUNT?

Your sum uses Parallel Seq Scan, which turns out to be much faster. The count, on the other hand uses an Index Only Scan, which is typically the fastest way, but no Parallel support there. That's what probably makes the difference. (May be a limitation of pg 9.5, not sure.)

You are using Postgres 9.5. Parallel queries were pretty new and the cost estimation not too accurate - similar considerations apply for index-only scans, to a lesser degree. So the planner makes at least one bad cost estimate: 489270.15 for the count with and Index Only Scan, vs. 658027.40 for the sum - for which an index-only scan is probably not possible (no index covering amount and exchange). Maybe you also have cost settings that don't reflect reality well, which is typically involved in bad estimates. See:

Keep PostgreSQL from sometimes choosing a bad query plan

And I see Heap Fetches: 4085161 for the count, which strikes me as surprisingly high for an index-only scan on 11M rows. Not sure why that is, maybe a VACUUM ANALYZE on the table might change things.

All that aside, while t.id is defined NOT NULL, use count(*) instead of count(t.id), which is a bit faster, since Postgres does not have to look into the stored (index) tuple at all. The mere existence of rows is sufficient.

And upgrade to a current version of Postgres, where either variant will be substantially faster. There have been major improvements to big data and parallel queries.

BTW, if the count does not have to be exact, there are much faster options:

Fast way to discover the row count of a table in PostgreSQL

Stack Exchange Network

How can it be possible that sum operation is so much faster than count?

1 Answer 1

Linked

Hot Network Questions

How can it be possible that sum operation is so much faster than count?

1 Answer 1

Linked

Related

Hot Network Questions