0

There are lots of questions on this topic, but all of them seem to be more complex cases than what I'm looking at at the moment and the answers don't seem applicable.

OHDSI=> \d record_counts Table "results2.record_counts" Column | Type | Modifiers ------------------------+-----------------------+----------- concept_id | integer | schema | text | table_name | text | column_name | text | column_type | text | descendant_concept_ids | bigint | rc | numeric | drc | numeric | domain_id | character varying(20) | vocabulary_id | character varying(20) | concept_class_id | character varying(20) | standard_concept | character varying(1) | Indexes: "rc_dom" btree (domain_id, concept_id) "rcdom" btree (domain_id) "rcdomvocsc" btree (domain_id, vocabulary_id, standard_concept) 

The table has 3,133,778 records, so Postgres shouldn't be ignoring the index because of small table size.

I filter on domain_id, which is indexed, and the index is ignored:

OHDSI=> explain select * from record_counts where domain_id = 'Drug'; QUERY PLAN ------------------------------------------------------------------------ Seq Scan on record_counts (cost=0.00..76744.81 rows=2079187 width=87) Filter: ((domain_id)::text = 'Drug'::text) 

I turn off seqscan and:

OHDSI=> set enable_seqscan=false; SET OHDSI=> explain select * from record_counts where domain_id = 'Drug'; QUERY PLAN ------------------------------------------------------------------------------------- Bitmap Heap Scan on record_counts (cost=42042.13..105605.97 rows=2079187 width=87) Recheck Cond: ((domain_id)::text = 'Drug'::text) -> Bitmap Index Scan on rcdom (cost=0.00..41522.33 rows=2079187 width=0) Index Cond: ((domain_id)::text = 'Drug'::text) 

Indeed, the plan says it's going to be more expensive to use the index than not, but why? If the index lets it handle many fewer records, shouldn't it be quicker to use it?

8
  • Are your statistics up to date? Commented Jan 23, 2017 at 12:58
  • BTW: Doesn't the table have a primary key? Commented Jan 23, 2017 at 13:02
  • @MikeSherrill'CatRecall': I ran analyze record_counts. Is there anything else I need to do? Commented Jan 23, 2017 at 13:13
  • @joop: It could have one on (concept_id, schema, table_name, column_name). Would that make a difference? Commented Jan 23, 2017 at 13:14
  • 1
    Yes, it makes a lot of difference. Not for the performance per se, but for the value of your data model. (In short: a table without (at least) a unique constraint basically has no meaning) Commented Jan 23, 2017 at 13:21

1 Answer 1

4

Ok, it looks like Postgres knew what it was doing. The particular value of the indexed column I was using ('Drug') happened to account for 66% of the rows in the table. So, yes, the filter makes the row set significantly smaller, but since those rows would be scattered between pages, the index doesn't allow them to be retrieved faster.

OHDSI=> select domain_id, count(*) as rows, round((100 * count(*)::float / 3133778.0)::numeric,4) pct from record_counts group by 1 order by 2 desc; domain_id | rows | pct ---------------------+---------+--------- Drug | 2074991 | 66.2137 Condition | 466882 | 14.8984 Observation | 217807 | 6.9503 Procedure | 165800 | 5.2907 Measurement | 127239 | 4.0602 Device | 29410 | 0.9385 Spec Anatomic Site | 28783 | 0.9185 Meas Value | 10415 | 0.3323 Unit | 2350 | 0.0750 Type Concept | 2170 | 0.0692 Provider Specialty | 1957 | 0.0624 Specimen | 1767 | 0.0564 Metadata | 1689 | 0.0539 Revenue Code | 538 | 0.0172 Place of Service | 480 | 0.0153 Race | 467 | 0.0149 Relationship | 242 | 0.0077 Condition/Obs | 182 | 0.0058 Currency | 180 | 0.0057 Condition/Meas | 115 | 0.0037 Route | 81 | 0.0026 Obs/Procedure | 78 | 0.0025 Condition/Device | 52 | 0.0017 Condition/Procedure | 25 | 0.0008 Meas/Procedure | 25 | 0.0008 Gender | 19 | 0.0006 Device/Procedure | 9 | 0.0003 Meas Value Operator | 9 | 0.0003 Visit | 8 | 0.0003 Drug/Procedure | 3 | 0.0001 Spec Disease Status | 3 | 0.0001 Ethnicity | 2 | 0.0001 

When I use any other value in the where clause (including 'Condition', with 15% of the rows), Postgres does use the index.

(Somewhat surprisingly, even after I cluster the table based on the domain_id index, it still doesn't use the index when I filter on 'Drug', but the performance improvement for filtering out 34% of the rows doesn't seem worth pursuing this further.)

Sign up to request clarification or add additional context in comments.

1 Comment

For a condition that returns 66% of all rows an index lookup will never be faster then a seq scan. Because retrieval through an index requires 3-4 I/O operations per row (random access). See e.g. here: stackoverflow.com/a/5203827/330315

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.