How to index a postgres table by name, when the name can be in any language?

Question

I have a large postgres table of locations (shops, landmarks, etc.) which the user can search in various ways. When the user wants to do a search for the name of a place, the system currently does (assuming the search is on cafe):

lower(location_name) LIKE '%cafe%'

as part of the query. This is hugely inefficient. Prohibitively so. It is essential I make this faster. I've tried indexing the table on

gin(to_tsvector('simple', location_name))

and searching with

(to_tsvector('simple',location_name) @@ to_tsquery('simple','cafe'))

which works beautifully, and cuts down the search time by a couple of orders of magnitude.

However, the location names can be in any language, including languages like Chinese, which aren't whitespace delimited. This new system is unable to find any Chinese locations, unless I search for the exact name, whereas the old system could find matches to partial names just fine.

So, my question is: Can I get this to work for all languages at once, or am I on the wrong track?

willglynn · Accepted Answer · 2012-10-15 18:38:16Z

If you want to optimize arbitrary substring matches, one option is to use the pg_tgrm module. Add an index:

CREATE INDEX table_location_name_trigrams_key ON table USING gin (location_name gin_trgm_ops);

This will break "Simple Cafe" into "sim", "imp", "mpl", etc., and add an entry to the index for each trigam in each row. The query planner can then automatically use this index for substring pattern matches, including:

SELECT * FROM table WHERE location_name ILIKE '%cafe%';

This query will look up "caf" and "afe" in the index, find the intersection, fetch those rows, then check each row against your pattern. (That last check is necessary since the intersection of "caf" and "afe" matches both "simple cafe" and "unsafe scaffolding", while "%cafe%" should only match one). The index becomes more effective as the input pattern gets longer since it can exclude more rows, but it's still not as efficient as indexing whole words, so don't expect a performance improvement over to_tsvector.

Catch is, trigrams don't work at all for patterns that under three characters. That may or may not be a deal-breaker for your application.

Edit: I initially added this as a comment.

I had another thought last night when I was mostly asleep. Make a cjk_chars function that takes an input string, regexp_matches the entire CJK Unicode ranges, and returns an array of any such characters or NULL if none. Add a GIN index on cjk_chars(location_name). Then query for:

WHERE CASE WHEN cjk_chars('query') IS NOT NULL THEN cjk_chars(location_name) @> cjk_chars('query') AND location_name LIKE '%query%' ELSE <tsvector/trigrams> END

Ta-da, unigrams!

Hmm... with the amount of information stored in a single Chinese character, I don't imagine it'd be unlikely for a search to contain only two. But, as I suspect the real answer will be a mix of different techniques, I'll keep this in mind.
I had another thought last night when I was mostly asleep. Make a cjk_chars function that takes an input string, regexp_matches the entire CJK Unicode ranges, and returns an array of any such characters or NULL if none. Add a GIN index on cjk_chars(location_name). Then query for WHERE CASE WHEN cjk_chars('query') IS NOT NULL THEN cjk_chars(location_name) @> cjk_chars('query') AND location_name LIKE '%input%' ELSE <tsvector/trigrams> END. Ta-da, unigrams!
Initial tests show that cjk-unigrams plus ts_vectors is a really good way of doing this. I very much like it. I don't know if there are any other edge-cases or character sets I'll need to cover, but now I probably know enough to handle them on my own. If you add it to your answer, or, better, create a new answer for it, I'd be happy to accept it. :)

Craig Ringer · Accepted Answer · 2012-10-11 04:12:14Z

For full text search in a multi-language environment you need to store the language each datum is in along side the text its self. You can then use the language-specified flavours of the tsearch functions to get proper stemming, etc.

eg given:

CREATE TABLE location( location_name text, location_name_language text );

... plus any appropriate constraints, you might write:

CREATE INDEX location_name_ts_idx USING gin(to_tsvector(location_name_language, location_name));

and for search:

SELECT to_tsvector(location_name_language,location_name) @@ to_tsquery('english','cafe');

Cross-language searches will be problematic no matter what you do. In practice I'd use multiple matching strategies: I'd compare the search term to the tsvector of location_name in the simple configuration and the stored language of the text. I'd possibly also use a trigram based approach like willglynn suggests, then I'd unify the results for display, looking for common terms.

It's possible you may find Pg's fulltext search too limited, in which case you might want to check out something like Lucerne / Solr.

See: * controlling full text search. * tsearch dictionaries

Erwin Brandstetter · Accepted Answer · 2015-06-12 00:47:45Z

Similar to what @willglynn already posted, I would consider the pg_trgm module. But preferably with a GiST index:

CREATE INDEX tbl_location_name_trgm_idx USING gist(location_name gist_trgm_ops);

The gist_trgm_ops operator class ignore case generally, and ILIKE is just as fast as LIKE. Quoting the source code:

Caution: IGNORECASE macro means that trigrams are case-insensitive.

I use COLLATE "C" here - which is effectively no special collation (byte order instead), because you obviously have a mix of various collations in your column. Collation is relevant for ordering or ranges, for a basic similarity search, you can do without it. I would consider setting COLLATE "C" for your column to begin with.

This index would lend support to your first, simple form of the query:

SELECT * FROM tbl WHERE location_name ILIKE '%cafe%';

Very fast.
Retains capability to find partial matches.
Adds capability for fuzzy search.
Check out the % operator and set_limit().
GiST index is also very fast for queries with LIMIT n to select n "best" matches. You could add to the above query:

ORDER BY location_name <-> 'cafe' LIMIT 20

Read more about the "distance" operator <-> in the manual here.

Or even:

SELECT * FROM tbl WHERE location_name ILIKE '%cafe%' -- exact partial match OR location_name % 'cafe' -- fuzzy match ORDER BY (location_name ILIKE 'cafe%') DESC -- exact beginning first ,(location_name ILIKE '%cafe%') DESC -- exact partial match next ,(location_name <-> 'cafe') -- then "best" matches ,location_name -- break remaining ties (collation!) LIMIT 20;

I use something like that in several applications for (to me) satisfactory results. Of course, it gets a bit slower with multiple features applied in combination. Find your sweet spot ...

You could go one step further and create a separate partial index for every language and use a matching collation for each:

CREATE INDEX location_name_trgm_idx USING gist(location_name COLLATE "de_DE" gist_trgm_ops) WHERE location_name_language = 'German'; -- repeat for each language

That would only be useful, if you only want results of a specific language per query and would be very fast in this case.

Collectives™ on Stack Overflow

How to index a postgres table by name, when the name can be in any language?

3 Answers 3

3 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Linked

Related