Postgres Text Search with Additional Words/Tokens

Question

I have a table with English sentences. Given a sentence which may contain an additional word, or a distorted word, can I find the closest sentence in the table using Postgres' Text-Search capabilities?

to_tsvector('a b c') @@ plainto_tsquery('a b') returns true

to_tsvector('a b') @@ plainto_tsquery('a b c') returns false

I would like scenario 2 to return true as well.

Notes:

The length of the sentences may be dozens of words. I'm looking for an efficient solution..
Other text search engines such as Elastic/Solr will successfully return the closest result.

More information regrading performance of the trigram index.

EXPLAIN (ANALYSE, BUFFERS) SELECT similarity(title, 'electode paste composition') as sml, title FROM table WHERE title % 'electode paste composition' ORDER BY sml DESC;

returns:

Gather Merge (cost=1880112.22..1902381.94 rows=190870 width=93) (actual time=36355.303..36356.143 rows=5 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=407649 -> Sort (cost=1879112.20..1879350.78 rows=95435 width=93) (actual time=36344.180..36344.180 rows=2 loops=3) Sort Key: (similarity(title, 'electode paste composition'::text)) DESC" Sort Method: quicksort Memory: 25kB Worker 0: Sort Method: quicksort Memory: 25kB Worker 1: Sort Method: quicksort Memory: 25kB Buffers: shared hit=407649 -> Parallel Bitmap Heap Scan on table (cost=2759.10..1866325.66 rows=95435 width=93) (actual time=35940.284..36344.141 rows=2 loops=3) Recheck Cond: (title % 'electode paste composition'::text)" Rows Removed by Index Recheck: 14904 Heap Blocks: exact=16199 Buffers: shared hit=407635 -> Bitmap Index Scan on title_trgm (cost=0.00..2701.84 rows=229045 width=0) (actual time=35543.907..35543.907 rows=44716 loops=1) Index Cond: (title % 'electode paste composition'::text)" Buffers: shared hit=362988 Planning Time: 0.084 ms Execution Time: 36356.187 ms

The same query using a tsquery takes less than 2.5 seconds.

If you want to refer to how other products do it, you should include enough information that we can actually figure out what you mean. — jjanes
– jjanes, Commented Feb 11, 2020 at 22:40

Laurenz Albe · Accepted Answer · 2020-02-11 16:38:51Z

1

Assuming that your words are longer than one character, I'd recommend trigram indexes:

CREATE EXTENSION pg_trgm; CREATE INDEX ON atable USING gin (textcol gin_trgm_ops); SELECT * FROM atable WHERE textcol % 'search string';

% is the similarity operator.

answered Feb 11, 2020 at 16:38

Laurenz Albe

257k22 gold badges312 silver badges388 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Tomer Over a year ago

Unfortunately, trigram search performance is not good enough.

Laurenz Albe Over a year ago

Seems unlikely. Perhaps your search strings are too short or too general. As I wrote, words consisting of a single letter won't work, but then that is not the real-world use case, I'd expect. Hard to say anything more without - say - the EXPLAIN (ANALYZE, BUFFERS) output of such an execution that is "too slow".

Tomer Over a year ago

I've added the info you requested to the description, please take a look.

Laurenz Albe Over a year ago

Try to VACUUM the table to optimize the index, although I don't expect that to have a big impact. Of course it is slower than a full text search; it is doing more work. If the trigram index is not good enough, you are out of options.

Collectives™ on Stack Overflow

Postgres Text Search with Additional Words/Tokens

1 Answer 1

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Related