Revisions to Optimize a trigram search with custom sort order

links, format, markup, fiddle

edited Aug 1, 2023 at 17:38

186.5k
28
465
639

Trigram similarity and distance operators Trigram similarity and distance operators put more weight on leading matches (prefix) automatically and to a lesser extent on trailing matches (suffix), due to the way trigrams are extracted from strings. The manual:The manual:

CREATE TABLE search (id int PRIMARY KEY, search_on text, comment text); INSERT INTO search (id, search_on, comment) VALUES ( 1, 'abc123456789', 'leading') , ( 2, '123abc456789', 'nested') , ( 3, '123456789abc', 'trailing') , ( 4, 'abc123abc456', 'leading, nested 1x') , ( 5, '123abc456abc', 'trailing,nested 1x') , ( 6, 'abcabcabc123', 'leading, nested 2x') , ( 7, '123abcabcabc', 'trailing nested 2x') , ( 8, '1abcabcabc23', 'nested 3x') , (10, 'abc12' , 'leading short') , (11, '12abc' , 'trailing short') , (12, '1abc2' , 'nested short'); CREATE INDEX index_search_search_on ON search USING gist (search_on gist_trgm_ops);

CREATE TABLE search (id int PRIMARY KEY, search_on text, comment text); INSERT INTO search (id, search_on, comment) VALUES ( 1, 'abc123456789', 'leading') , ( 2, '123abc456789', 'nested') , ( 3, '123456789abc', 'trailing') , ( 4, 'abc123abc456', 'leading, nested 1x') , ( 5, '123abc456abc', 'trailing,nested 1x') , ( 6, 'abcabcabc123', 'leading, nested 2x') , ( 7, '123abcabcabc', 'trailing nested 2x') , ( 8, '1abcabcabc23', 'nested 3x') , (10, 'abc12' , 'leading short') , (11, '12abc' , 'trailing short') , (12, '1abc2' , 'nested short') ; CREATE INDEX index_search_search_on ON search USING gist (search_on gist_trgm_ops);

Any downsides of using data type “text” for storing strings?Any downsides of using data type “text” for storing strings?

SELECT *, search_on <-> 'abc' AS distance FROM search WHERE search_on % 'abc' ORDER BY search_on <-> 'abc';

SELECT *, search_on <-> 'abc' AS distance FROM search WHERE search_on % 'abc' ORDER BY search_on <-> 'abc';

 id | search_on | comment | distance -: | :----------- | :----------------- | :------- 10 | abc12 | leading short | 0.571429 6 | abcabcabc123 | leading, nested 2x | 0.7 11 | 12abc | trailing short | 0.75 4 | abc123abc456 | leading, nested 1x | 0.769231 1 | abc123456789 | leading | 0.785714 7 | 123abcabcabc | trailing nested 2x | 0.818182 5 | 123abc456abc | trailing,nested 1x | 0.857143 3 | 123456789abc | trailing | 0.866667 12 | 1abc2 | nested short | 0.888889 8 | 1abcabcabc23 | nested 3x | 0.916667 2 | 123abc456789 | nested | 0.9375

id | search_on | comment | distance -: | :----------- | :----------------- | :------- 10 | abc12 | leading short | 0.571429 6 | abcabcabc123 | leading, nested 2x | 0.7 11 | 12abc | trailing short | 0.75 4 | abc123abc456 | leading, nested 1x | 0.769231 1 | abc123456789 | leading | 0.785714 7 | 123abcabcabc | trailing nested 2x | 0.818182 5 | 123abc456abc | trailing,nested 1x | 0.857143 3 | 123456789abc | trailing | 0.866667 12 | 1abc2 | nested short | 0.888889 8 | 1abcabcabc23 | nested 3x | 0.916667 2 | 123abc456789 | nested | 0.9375

SELECT *, search_on <-> 'abc' AS distance, search_on ILIKE 'abc%' AS prefix_match FROM search WHERE search_on % 'abc' ORDER BY search_on NOT ILIKE 'abc%' -- prefix matches first , search_on <-> 'abc'; -- then sort by distance

SELECT *, search_on <-> 'abc' AS distance, search_on ILIKE 'abc%' AS prefix_match FROM search WHERE search_on % 'abc' ORDER BY search_on NOT ILIKE 'abc%' -- prefix matches first , search_on <-> 'abc'; -- then sort by distance

 id | search_on | comment | distance | prefix -: | :----------- | :----------------- | :------- | :----- 10 | abc12 | leading short | 0.571429 | t 6 | abcabcabc123 | leading, nested 2x | 0.7 | t 4 | abc123abc456 | leading, nested 1x | 0.769231 | t 1 | abc123456789 | leading | 0.785714 | t 11 | 12abc | trailing short | 0.75 | f 7 | 123abcabcabc | trailing nested 2x | 0.818182 | f 5 | 123abc456abc | trailing,nested 1x | 0.857143 | f 3 | 123456789abc | trailing | 0.866667 | f 12 | 1abc2 | nested short | 0.888889 | f 8 | 1abcabcabc23 | nested 3x | 0.916667 | f 2 | 123abc456789 | nested | 0.9375 | f

id | search_on | comment | distance | prefix -: | :----------- | :----------------- | :------- | :----- 10 | abc12 | leading short | 0.571429 | t 6 | abcabcabc123 | leading, nested 2x | 0.7 | t 4 | abc123abc456 | leading, nested 1x | 0.769231 | t 1 | abc123456789 | leading | 0.785714 | t 11 | 12abc | trailing short | 0.75 | f 7 | 123abcabcabc | trailing nested 2x | 0.818182 | f 5 | 123abc456abc | trailing,nested 1x | 0.857143 | f 3 | 123456789abc | trailing | 0.866667 | f 12 | 1abc2 | nested short | 0.888889 | f 8 | 1abcabcabc23 | nested 3x | 0.916667 | f 2 | 123abc456789 | nested | 0.9375 | f

PostgreSQL sort by datetime asc, null first?PostgreSQL sort by datetime asc, null first?

... ORDER BY search_on NOT ILIKE 'abc%' -- prefix matches first , search_on NOT ILIKE '%abc' -- suffix matches next , search_on <-> 'abc'; -- then sort by distance

... ORDER BY search_on NOT ILIKE 'abc%' -- prefix matches first , search_on NOT ILIKE '%abc' -- suffix matches next , search_on <-> 'abc'; -- then sort by distance

db<>fiddle herefiddle

Trigram similarity and distance operators put more weight on leading matches (prefix) automatically and to a lesser extent on trailing matches (suffix), due to the way trigrams are extracted from strings. The manual:

CREATE TABLE search (id int PRIMARY KEY, search_on text, comment text); INSERT INTO search (id, search_on, comment) VALUES ( 1, 'abc123456789', 'leading') , ( 2, '123abc456789', 'nested') , ( 3, '123456789abc', 'trailing') , ( 4, 'abc123abc456', 'leading, nested 1x') , ( 5, '123abc456abc', 'trailing,nested 1x') , ( 6, 'abcabcabc123', 'leading, nested 2x') , ( 7, '123abcabcabc', 'trailing nested 2x') , ( 8, '1abcabcabc23', 'nested 3x') , (10, 'abc12' , 'leading short') , (11, '12abc' , 'trailing short') , (12, '1abc2' , 'nested short'); CREATE INDEX index_search_search_on ON search USING gist (search_on gist_trgm_ops);

Any downsides of using data type “text” for storing strings?

SELECT *, search_on <-> 'abc' AS distance FROM search WHERE search_on % 'abc' ORDER BY search_on <-> 'abc';

 id | search_on | comment | distance -: | :----------- | :----------------- | :------- 10 | abc12 | leading short | 0.571429 6 | abcabcabc123 | leading, nested 2x | 0.7 11 | 12abc | trailing short | 0.75 4 | abc123abc456 | leading, nested 1x | 0.769231 1 | abc123456789 | leading | 0.785714 7 | 123abcabcabc | trailing nested 2x | 0.818182 5 | 123abc456abc | trailing,nested 1x | 0.857143 3 | 123456789abc | trailing | 0.866667 12 | 1abc2 | nested short | 0.888889 8 | 1abcabcabc23 | nested 3x | 0.916667 2 | 123abc456789 | nested | 0.9375

SELECT *, search_on <-> 'abc' AS distance, search_on ILIKE 'abc%' AS prefix_match FROM search WHERE search_on % 'abc' ORDER BY search_on NOT ILIKE 'abc%' -- prefix matches first , search_on <-> 'abc'; -- then sort by distance

 id | search_on | comment | distance | prefix -: | :----------- | :----------------- | :------- | :----- 10 | abc12 | leading short | 0.571429 | t 6 | abcabcabc123 | leading, nested 2x | 0.7 | t 4 | abc123abc456 | leading, nested 1x | 0.769231 | t 1 | abc123456789 | leading | 0.785714 | t 11 | 12abc | trailing short | 0.75 | f 7 | 123abcabcabc | trailing nested 2x | 0.818182 | f 5 | 123abc456abc | trailing,nested 1x | 0.857143 | f 3 | 123456789abc | trailing | 0.866667 | f 12 | 1abc2 | nested short | 0.888889 | f 8 | 1abcabcabc23 | nested 3x | 0.916667 | f 2 | 123abc456789 | nested | 0.9375 | f

PostgreSQL sort by datetime asc, null first?

... ORDER BY search_on NOT ILIKE 'abc%' -- prefix matches first , search_on NOT ILIKE '%abc' -- suffix matches next , search_on <-> 'abc'; -- then sort by distance

db<>fiddle here

Trigram similarity and distance operators put more weight on leading matches (prefix) automatically and to a lesser extent on trailing matches (suffix), due to the way trigrams are extracted from strings. The manual:

CREATE TABLE search (id int PRIMARY KEY, search_on text, comment text); INSERT INTO search (id, search_on, comment) VALUES ( 1, 'abc123456789', 'leading') , ( 2, '123abc456789', 'nested') , ( 3, '123456789abc', 'trailing') , ( 4, 'abc123abc456', 'leading, nested 1x') , ( 5, '123abc456abc', 'trailing,nested 1x') , ( 6, 'abcabcabc123', 'leading, nested 2x') , ( 7, '123abcabcabc', 'trailing nested 2x') , ( 8, '1abcabcabc23', 'nested 3x') , (10, 'abc12' , 'leading short') , (11, '12abc' , 'trailing short') , (12, '1abc2' , 'nested short') ; CREATE INDEX index_search_search_on ON search USING gist (search_on gist_trgm_ops);

Any downsides of using data type “text” for storing strings?

SELECT *, search_on <-> 'abc' AS distance FROM search WHERE search_on % 'abc' ORDER BY search_on <-> 'abc';

id | search_on | comment | distance -: | :----------- | :----------------- | :------- 10 | abc12 | leading short | 0.571429 6 | abcabcabc123 | leading, nested 2x | 0.7 11 | 12abc | trailing short | 0.75 4 | abc123abc456 | leading, nested 1x | 0.769231 1 | abc123456789 | leading | 0.785714 7 | 123abcabcabc | trailing nested 2x | 0.818182 5 | 123abc456abc | trailing,nested 1x | 0.857143 3 | 123456789abc | trailing | 0.866667 12 | 1abc2 | nested short | 0.888889 8 | 1abcabcabc23 | nested 3x | 0.916667 2 | 123abc456789 | nested | 0.9375

SELECT *, search_on <-> 'abc' AS distance, search_on ILIKE 'abc%' AS prefix_match FROM search WHERE search_on % 'abc' ORDER BY search_on NOT ILIKE 'abc%' -- prefix matches first , search_on <-> 'abc'; -- then sort by distance

id | search_on | comment | distance | prefix -: | :----------- | :----------------- | :------- | :----- 10 | abc12 | leading short | 0.571429 | t 6 | abcabcabc123 | leading, nested 2x | 0.7 | t 4 | abc123abc456 | leading, nested 1x | 0.769231 | t 1 | abc123456789 | leading | 0.785714 | t 11 | 12abc | trailing short | 0.75 | f 7 | 123abcabcabc | trailing nested 2x | 0.818182 | f 5 | 123abc456abc | trailing,nested 1x | 0.857143 | f 3 | 123456789abc | trailing | 0.866667 | f 12 | 1abc2 | nested short | 0.888889 | f 8 | 1abcabcabc23 | nested 3x | 0.916667 | f 2 | 123abc456789 | nested | 0.9375 | f

PostgreSQL sort by datetime asc, null first?

... ORDER BY search_on NOT ILIKE 'abc%' -- prefix matches first , search_on NOT ILIKE '%abc' -- suffix matches next , search_on <-> 'abc'; -- then sort by distance

fiddle

Commonmark migration

Source Link

edited Jun 15, 2020 at 9:05

Community Bot

1

Test setup

###Test setup BasedBased on your table definition and example:

Queries

###Queries WeWe need a low threshold for the demo:

more instructive demo, add more

Source Link

edited May 14, 2018 at 16:54

Erwin Brandstetter

186.5k
28
465
639

Trigram similarity and distance operators put more weight on leading matches (prefix) automatically and to a lesser extent on trailingtrailing matches (suffix), due to the way trigrams are extracted from strings. The manual:

There is more, including examples. Read the manual. And consider the demodemos in my fiddle at the bottom of this answer.

 id | search_on | comment | distance | prefix_matchprefix -: | :----------- | :----------------- | :------- | :----------- 10 | abc12 | leading short | 0.571429 | t 6 | abcabcabc123 | leading, nested 2x | 0.7 | t 4 | abc123abc456 | leading, nested 1x | 0.769231 | t 1 | abc123456789 | leading | 0.785714 | t 11 | 12abc | trailing short | 0.75 | f 7 | 123abcabcabc | trailing nested 2x | 0.818182 | f 5 | 123abc456abc | trailing,nested 1x | 0.857143 | f 3 | 123456789abc | trailing | 0.866667 | f 12 | 1abc2 | nested short | 0.888889 | f 8 | 1abcabcabc23 | nested 3x | 0.916667 | f 2 | 123abc456789 | nested | 0.9375 | f

db<>fiddle here

You could sort trailing matches in a similar fashion or combine both:

db<>fiddle here

Trigram similarity and distance operators put more weight on leading matches automatically and to a lesser extent on trailing matches, due to the way trigrams are extracted from strings. The manual:

There is more, including examples. Read the manual. And consider the demo in my fiddle at the bottom.

 id | search_on | comment | distance | prefix_match -: | :----------- | :----------------- | :------- | :----------- 10 | abc12 | leading short | 0.571429 | t 6 | abcabcabc123 | leading, nested 2x | 0.7 | t 4 | abc123abc456 | leading, nested 1x | 0.769231 | t 1 | abc123456789 | leading | 0.785714 | t 11 | 12abc | trailing short | 0.75 | f 7 | 123abcabcabc | trailing nested 2x | 0.818182 | f 5 | 123abc456abc | trailing,nested 1x | 0.857143 | f 3 | 123456789abc | trailing | 0.866667 | f 12 | 1abc2 | nested short | 0.888889 | f 8 | 1abcabcabc23 | nested 3x | 0.916667 | f 2 | 123abc456789 | nested | 0.9375 | f

db<>fiddle here

You could sort trailing matches in a similar fashion or combine both:

Trigram similarity and distance operators put more weight on leading matches (prefix) automatically and to a lesser extent on trailing matches (suffix), due to the way trigrams are extracted from strings. The manual:

There is more, including examples. Read the manual. And consider the demos in my fiddle at the bottom of this answer.

 id | search_on | comment | distance | prefix -: | :----------- | :----------------- | :------- | :----- 10 | abc12 | leading short | 0.571429 | t 6 | abcabcabc123 | leading, nested 2x | 0.7 | t 4 | abc123abc456 | leading, nested 1x | 0.769231 | t 1 | abc123456789 | leading | 0.785714 | t 11 | 12abc | trailing short | 0.75 | f 7 | 123abcabcabc | trailing nested 2x | 0.818182 | f 5 | 123abc456abc | trailing,nested 1x | 0.857143 | f 3 | 123456789abc | trailing | 0.866667 | f 12 | 1abc2 | nested short | 0.888889 | f 8 | 1abcabcabc23 | nested 3x | 0.916667 | f 2 | 123abc456789 | nested | 0.9375 | f

You could sort trailing matches in a similar fashion or combine both:

db<>fiddle here