38

This relates to counting the number of records that match a certain condition, e.g. invoice amount > $100.

I tend to prefer

COUNT(CASE WHEN invoice_amount > 100 THEN 1 END) 

However, this is just as valid

SUM(CASE WHEN invoice_amount > 100 THEN 1 ELSE 0 END) 

I would have thought COUNT is preferable for 2 reasons:

  1. Conveys the intention, which is to COUNT
  2. COUNT probably involves a simple i += 1 operation somewhere, whereas SUM cannot count on its expression to be a simple integer value.

Does anyone have specific facts about the difference on specific RDBMS?

0

3 Answers 3

39

You mostly answered the question yourself already. I have a few morsels to add:

In PostgreSQL (and other RDBMS that support the boolean type) you can use the boolean result of the test directly. Cast it to integer and SUM():

SUM((amount > 100)::int)) 

Or use it in a NULLIF() expression and COUNT():

COUNT(NULLIF(amount > 100, FALSE)) 

Or with a simple OR NULL:

COUNT(amount > 100 OR NULL) 

Or various other expressions. Performance is almost identical. COUNT() is typically very slightly faster than SUM(). Unlike SUM() and like Paul already commented, COUNT() never returns NULL, which may be convenient. Related:

Since Postgres 9.4 there's also the aggregate FILTER clause. See:

It's faster than all of the above by around 5 - 10 %:

COUNT(*) FILTER (WHERE amount > 100) 

If the query is as simple as your test case, with only a single count and nothing else, you can rewrite:

SELECT count(*) FROM tbl WHERE amount > 100; 

... which is the true king of performance, even without index.
With an applicable index it can be faster by orders of magnitude, especially with index-only scans.

Benchmarks

Postgres 13

db<>fiddle here

Basically the same results as for Postgres 10 below. (I also added a test without the new parallelism to the fiddle, to compare apples with apples.)

Postgres 10

I ran a new series of tests for Postgres 10, including the aggregate FILTER clause and demonstrating the role of an index for small and big counts.

Simple setup:

CREATE TABLE tbl ( tbl_id int , amount int NOT NULL ); INSERT INTO tbl SELECT g, (random() * 150)::int FROM generate_series (1, 1000000) g; -- only relevant for the last test CREATE INDEX ON tbl (amount); 

Actual times vary quite a bit due to background noise and specifics of the test bed. Showing typical best times from a bigger set of tests. These two cases should capture the essence:

Test 1 counting ~ 1 % of all rows

SELECT COUNT(NULLIF(amount > 148, FALSE)) FROM tbl; -- 140 ms SELECT SUM((amount > 148)::int) FROM tbl; -- 136 ms SELECT SUM(CASE WHEN amount > 148 THEN 1 ELSE 0 END) FROM tbl; -- 133 ms SELECT COUNT(CASE WHEN amount > 148 THEN 1 END) FROM tbl; -- 130 ms SELECT COUNT((amount > 148) OR NULL) FROM tbl; -- 130 ms SELECT COUNT(*) FILTER (WHERE amount > 148) FROM tbl; -- 118 ms -- ! SELECT count(*) FROM tbl WHERE amount > 148; -- without index -- 75 ms -- !! SELECT count(*) FROM tbl WHERE amount > 148; -- with index -- 1.4 ms -- !!! 

db<>fiddle here

Test 2 counting ~ 33 % of all rows

SELECT COUNT(NULLIF(amount > 100, FALSE)) FROM tbl; -- 140 ms SELECT SUM((amount > 100)::int) FROM tbl; -- 138 ms SELECT SUM(CASE WHEN amount > 100 THEN 1 ELSE 0 END) FROM tbl; -- 139 ms SELECT COUNT(CASE WHEN amount > 100 THEN 1 END) FROM tbl; -- 138 ms SELECT COUNT(amount > 100 OR NULL) FROM tbl; -- 137 ms SELECT COUNT(*) FILTER (WHERE amount > 100) FROM tbl; -- 132 ms -- ! SELECT count(*) FROM tbl WHERE amount > 100; -- without index -- 102 ms -- !! SELECT count(*) FROM tbl WHERE amount > 100; -- with index -- 55 ms -- !!! 

db<>fiddle here

The last test in each set used an index-only scan, which is why it helped for counting one third of all rows. Plain index or bitmap index scans cannot compete with a sequential scan when involving roughly 5 % or more of all rows.

Old test for Postgres 9.1

To verify I ran a quick test with EXPLAIN ANALYZE on a real life table in PostgreSQL 9.1.6.

74208 of 184568 rows qualified with the condition kat_id > 50. All queries return the same result. I ran each like 10 times in turns to exclude caching effects and appended the best result as note:

SELECT SUM((kat_id > 50)::int) FROM log_kat; -- 438 ms SELECT COUNT(NULLIF(kat_id > 50, FALSE)) FROM log_kat; -- 437 ms SELECT COUNT(CASE WHEN kat_id > 50 THEN 1 END) FROM log_kat; -- 437 ms SELECT COUNT((kat_id > 50) OR NULL) FROM log_kat; -- 436 ms SELECT SUM(CASE WHEN kat_id > 50 THEN 1 ELSE 0 END) FROM log_kat; -- 432 ms 

Hardly any real difference in performance.

0
12

This is my test on SQL Server 2012 RTM.

if object_id('tempdb..#temp1') is not null drop table #temp1; if object_id('tempdb..#timer') is not null drop table #timer; if object_id('tempdb..#bigtimer') is not null drop table #bigtimer; GO select a.* into #temp1 from master..spt_values a join master..spt_values b on b.type='p' and b.number < 1000; alter table #temp1 add id int identity(10,20) primary key clustered; create table #timer ( id int identity primary key, which bit not null, started datetime2 not null, completed datetime2 not null, ); create table #bigtimer ( id int identity primary key, which bit not null, started datetime2 not null, completed datetime2 not null, ); GO --set ansi_warnings on; set nocount on; dbcc dropcleanbuffers with NO_INFOMSGS; dbcc freeproccache with NO_INFOMSGS; declare @bigstart datetime2; declare @start datetime2, @dump bigint, @counter int; set @bigstart = sysdatetime(); set @counter = 1; while @counter <= 100 begin set @start = sysdatetime(); select @dump = count(case when number < 100 then 1 end) from #temp1; insert #timer values (0, @start, sysdatetime()); set @counter += 1; end; insert #bigtimer values (0, @bigstart, sysdatetime()); set nocount off; GO set nocount on; dbcc dropcleanbuffers with NO_INFOMSGS; dbcc freeproccache with NO_INFOMSGS; declare @bigstart datetime2; declare @start datetime2, @dump bigint, @counter int; set @bigstart = sysdatetime(); set @counter = 1; while @counter <= 100 begin set @start = sysdatetime(); select @dump = SUM(case when number < 100 then 1 else 0 end) from #temp1; insert #timer values (1, @start, sysdatetime()); set @counter += 1; end; insert #bigtimer values (1, @bigstart, sysdatetime()); set nocount off; GO 

Looking at individual runs and batches separately

select which, min(datediff(mcs, started, completed)), max(datediff(mcs, started, completed)), avg(datediff(mcs, started, completed)) from #timer group by which select which, min(datediff(mcs, started, completed)), max(datediff(mcs, started, completed)), avg(datediff(mcs, started, completed)) from #bigtimer group by which 

The results after running a 5 times (and repeating) is quite inconclusive.

which ** Individual ----- ----------- ----------- ----------- 0 93600 187201 103927 1 93600 187201 103864 which ** Batch ----- ----------- ----------- ----------- 0 10108817 10545619 10398978 1 10327219 10498818 10386498 

It shows that there is far more variability in the running conditions than there is difference between the implementation, when measured with the granularity of the SQL Server timer. Either version can come on top, and the maximum variance I have ever got is 2.5%.

However, taking a different approach:

set showplan_text on; GO select SUM(case when number < 100 then 1 else 0 end) from #temp1; select count(case when number < 100 then 1 end) from #temp1; 

StmtText (SUM)

 |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1011]=(0) THEN NULL ELSE [Expr1012] END)) |--Stream Aggregate(DEFINE:([Expr1011]=Count(*), [Expr1012]=SUM([Expr1004]))) |--Compute Scalar(DEFINE:([Expr1004]=CASE WHEN [tempdb].[dbo].[#temp1].[number]<(100) THEN (1) ELSE (0) END)) |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[#temp1])) 

StmtText (COUNT)

 |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1008],0))) |--Stream Aggregate(DEFINE:([Expr1008]=COUNT([Expr1004]))) |--Compute Scalar(DEFINE:([Expr1004]=CASE WHEN [tempdb].[dbo].[#temp1].[number]<(100) THEN (1) ELSE NULL END)) |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[#temp1])) 

From my reading, it would appear that the SUM version does a little more. It is performing a COUNT in addition to a SUM. Having said that, COUNT(*) is different and should be faster than COUNT([Expr1004]) (skip NULLs, more logic). A reasonable optimizer will realise that [Expr1004] in SUM([Expr1004]) in the SUM version is an "int" type and so utilise an integer register.

In any case, while I still believe the COUNT version will be faster in most RDBMS, my conclusion from testing is that I am going to go with SUM(.. 1.. 0..) in the future, at least for SQL Server for no other reason than the ANSI WARNINGS being raised when using COUNT.

0
0

In My experience Making a trace, for both methods in a Query of about 10,000,000 I Noticed that Count(*) use a around twice of CPU and run a bit faster. but my Queries are without filter.

Count(*)

CPU...........: 1828 Execution time: 470 ms 

Sum(1)

CPU...........: 3859 Execution time: 681 ms 
1
  • 1
    Yo should specify which RDBMS you have used to make this test. Commented Aug 26, 2019 at 11:16

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.