17

I'm testing out the PostgreSQL Text-Search features, using the September data dump from StackOverflow as sample data. :-)

The naive approach of using LIKE predicates or POSIX regular expression matching to search 1.2 million rows takes about 90-105 seconds (on my Macbook) to do a full table-scan searching for a keyword.

SELECT * FROM Posts WHERE body LIKE '%postgresql%'; SELECT * FROM Posts WHERE body ~ 'postgresql'; 

An unindexed, ad hoc text-search query takes about 8 minutes:

SELECT * FROM Posts WHERE to_tsvector(body) @@ to_tsquery('postgresql'); 

Creating a GIN index takes about 40 minutes:

ALTER TABLE Posts ADD COLUMN PostText TSVECTOR; UPDATE Posts SET PostText = to_tsvector(body); CREATE INDEX PostText_GIN ON Posts USING GIN(PostText); 

(I realize I could also do this in one step by defining it as an expression index.)

Afterwards, a query assisted by a GIN index runs a lot faster -- this takes about 40 milliseconds:

SELECT * FROM Posts WHERE PostText @@ 'postgresql'; 

However, when I create a GiST index, the results are quite different. It takes less than 2 minutes to create the index:

CREATE INDEX PostText_GIN ON Posts USING GIST(PostText); 

Afterwards, a query using the @@ text-search operator takes 90-100 seconds. So GiST indexes do improve an unindexed TS query from 8 minutes to 1.5 minutes. But that's no improvement over doing a full table-scan with LIKE. It's useless in a web programming environment.

Am I missing something crucial to using GiST indexes? Do the indexes need to be pre-cached in memory or something? I am using a plain PostgreSQL installation from MacPorts, with no tuning.

What is the recommended way to use GiST indexes? Or does everyone doing TS with PostgreSQL skip GiST indexes and use only GIN indexes?

PS: I do know about alternatives like Sphinx Search and Lucene. I'm just trying to learn about the features provided by PostgreSQL itself.

3 Answers 3

7

The docs have a nice overview of the performance differences between GiST and GIN indexes if you're interested: GiST and GIN Index Types.

Sign up to request clarification or add additional context in comments.

Comments

6

try

CREATE INDEX PostText_GIST ON Posts USING GIST(PostText varchar_pattern_ops); 

which creates an index suitable for prefix queries. See the PostgreSQL docs on Operator Classes and Operator Families. The @@ operator is only sensible on term vectors; the GiST index (with varchar_pattern_ops) will give excellent results with LIKE.

2 Comments

It must have taken quite some time to generate that index. :)
This cannot possibly work because varchar_pattern_ops is for type varchar, and PostText is of type tsvector, and it is only defined for btree and hash indexes and not for gist.
2

btw: if this hasn't been answered to your satisfaction yet, the part where you did

SELECT * FROM Posts WHERE PostText @@ 'postgresql';

should have been

SELECT * FROM Posts WHERE PostText @@ to_tsquery('postgresql');

1 Comment

Thanks for the tip, I'll try that out next time I test PostgreSQL. I've been using MySQL pretty much exclusively for a few years.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.