A better way to write this query?

Question

I have the following schema for my database (Postgresql 10):

CREATE TABLE "PulledTexts" ( "Id" serial PRIMARY KEY, "BaseText" TEXT, "CleanText" TEXT ); CREATE TABLE "UniqueWords" ( "Id" serial PRIMARY KEY, "WordText" TEXT ); CREATE TABLE "WordTexts" ( "Id" serial PRIMARY KEY, "TextIdId" INTEGER REFERENCES "PulledTexts", "WordIdId" INTEGER REFERENCES "UniqueWords" ); CREATE INDEX "IX_WordTexts_TextIdId" ON "WordTexts" ("TextIdId"); CREATE INDEX "IX_WordTexts_WordIdId" ON "WordTexts" ("WordIdId");

Some sample data:

INSERT INTO public."PulledTexts" ("Id", "BaseText", "CleanText") VALUES (1, 'automate business audit', null), (2, 'audit trial', null), (3, 'trial', null), (4, 'audit', null), (5, 'fresh report', null), (6, 'fresh audit', null), (7, 'automate this script', null), (8, 'im trying here', null), (9, 'automate this business', null), (10, 'lateral', null); INSERT INTO public."UniqueWords" ("Id", "WordText") VALUES (1, 'trial'), (2, 'audit'), (3, 'creation'), (4, 'business'), (5, 'automate'); INSERT INTO public."WordTexts" ("Id", "TextIdId", "WordIdId") VALUES (1, 1, 2), (2, 1, 4), (3, 1, 5), (4, 2, 1), (5, 3, 1), (6, 4, 2), (7, 6, 2), (8, 7, 5), (9, 9, 4), (10, 9, 5), (11, 2, 2);

The database itself is created through Entity framework migration for now.

I would like to know if there's a better, specifically more performance efficient way of writing this query because the WordTexts table will contain hundreds of thousands of records and eventually millions. I am also open to going a NoSql route if that would be more efficient for these kinds of queries.

SELECT * FROM "PulledTexts" WHERE "Id" IN ( SELECT "TextIdId" FROM "WordTexts" AS "wordTexts" LEFT JOIN "UniqueWords" AS "wordTexts.WordId" ON "wordTexts"."WordIdId" = "wordTexts.WordId"."Id" WHERE "wordTexts.WordId"."WordText" = 'automate' OR "TextIdId" IN ( SELECT "TextIdId" and1 from "WordTexts" AS "wordTexts" LEFT JOIN "UniqueWords" AS "wordTexts.WordId" ON "wordTexts"."WordIdId" = "wordTexts.WordId"."Id" where "wordTexts.WordId"."WordText" = 'audit' INTERSECT SELECT "TextIdId" and2 from "WordTexts" AS "wordTexts" LEFT JOIN "UniqueWords" AS "wordTexts.WordId" ON "wordTexts"."WordIdId" = "wordTexts.WordId"."Id" WHERE "wordTexts.WordId"."WordText" = 'trial' ) );

At a high level, this query is supposed to return PulledTexts whose IDs match those returned by the subqueries. The subqueries are simply meant to return the list of IDs of PulledTexts that contain lets say ('audit' AND 'trial') OR 'automate' from the UniqueWords table. This is exactly what the example query I gave does. The WordTexts table is a simple mapping of UniqueWord to PulledText.

See if that update helps you out a bit more towards getting what you want @H.Rahimy — Evan Carroll
– Evan Carroll, Commented Dec 24, 2017 at 21:15

Erwin Brandstetter · Accepted Answer · 2019-09-14 22:53:35Z

While your query is valid, there is a lot I would differently.

Don't use CaMeL-case names in Postgres if it can be avoided. Your unnamed entity framework may force this nonsense upon you, but I don't want to deal with the double-quote mess, so I tested with your schema after removing all double-quotes - effectively making all identifiers lower-cased.
- Are PostgreSQL column names case-sensitive?
Don't use illegible or illegal column and table aliases (like "wordTexts.WordId"). That's a matter of taste and style (and sanity), but you also omitted the keyword AS where you should not and kept it where you could omit it.
- Date column arithmetic in PostgreSQL query
I also formatted some more to make it easier for me to grok the query. That last part is totally optional. But use some consistent formatting style.

Arriving at this:

SELECT * FROM PulledTexts WHERE Id IN ( SELECT w.TextIdId FROM WordTexts w -- AS can be omitted for table alias LEFT JOIN UniqueWords u ON w.WordIdId = u.Id -- LEFT JOIN might be necessary here WHERE u.WordText = 'automate' OR w.TextIdId IN ( SELECT w.TextIdId -- AS and1 -- column alias only documentation here, not visible FROM WordTexts w JOIN UniqueWords u ON w.WordIdId = u.Id -- LEFT JOIN misleading here WHERE u.WordText = 'audit' INTERSECT SELECT w.TextIdId -- AS and2 -- but don't omit AS for column alias FROM WordTexts w JOIN UniqueWords u ON w.WordIdId = u.Id WHERE u.WordText = 'trial' ) );

Which can be simplified to:

SELECT * FROM ( SELECT w.TextIdId AS Id FROM WordTexts w JOIN UniqueWords u ON w.WordIdId = u.Id -- now we don't need LEFT any more WHERE u.WordText = 'automate' UNION SELECT w.TextIdId FROM WordTexts w JOIN UniqueWords u ON w.WordIdId = u.Id WHERE u.WordText = 'audit' INTERSECT SELECT w.TextIdId FROM WordTexts w JOIN UniqueWords u ON w.WordIdId = u.Id WHERE u.WordText = 'trial' ) w JOIN PulledTexts p USING (Id)

We do not need additional parentheses because, according to the manual:

INTERSECT binds more tightly than UNION. That is, A UNION B INTERSECT C will be read as A UNION (B INTERSECT C).

But this will be faster when replacing multiple intersected subqueries:

SELECT * FROM ( SELECT w.TextIdId AS Id FROM WordTexts w JOIN UniqueWords u ON w.WordIdId = u.Id WHERE u.WordText = 'automate' UNION SELECT TextIdId FROM WordTexts w1 JOIN WordTexts w2 USING (TextIdId) WHERE w1.WordIdId = (SELECT Id FROM UniqueWords WHERE WordText = 'audit') AND w2.WordIdId = (SELECT Id FROM UniqueWords WHERE WordText = 'trial') ) w JOIN PulledTexts p USING (Id)

The INTERSECT part can be cast as relational division problem. Explanation in this related answer from just yesterday:

Counting with multiple joins

db<>fiddle here

Most important for performance is to have the right indexes. You probably should have a UNIQUE constraint on (WordIdId, TextIdId) in table WordTexts, which implements the currently missing index on those two columns in this order.

Does creating a unique constraint on a Postgres column remove the need to index it?

Evan Carroll · Accepted Answer · 2017-12-24 21:14:54Z

Full Text Search

What you're doing is creating a normalized relational Full Text Search. That seems problematic and wasteful. PostgreSQL (and most other professional databases) have the functionality to handle all of this..

SELECT txt, to_tsvector(txt) FROM ( VALUES ( 'automate business audit' ), ( 'audit trial' ), ( 'trial' ), ( 'audit' ), ( 'fresh report' ), ( 'fresh audit' ), ( 'automate this script' ), ( 'im trying here' ), ( 'automate this business' ), ( 'lateral' ) ) AS t(txt); txt | to_tsvector -------------------------+------------------------------ automate business audit | 'audit':3 'autom':1 'busi':2 audit trial | 'audit':1 'trial':2 trial | 'trial':1 audit | 'audit':1 fresh report | 'fresh':1 'report':2 fresh audit | 'audit':2 'fresh':1 automate this script | 'autom':1 'script':3 im trying here | 'im':1 'tri':2 automate this business | 'autom':1 'busi':3 lateral | 'later':1

You can see here that not all of these stems make sense. That's because we're using the default algorithmic stemmer Snowball. The important thing here is you can search these terms effectively because the queries undergo the same stemming: both queries and input get reduced to lexemes. If you need something more accurate that doesn't reduce to later or autom then Hunspell can take you the full distance: it'll be massively faster than your method, but slower then the Snowball method.

Using Full Text Search can drastically simplify your schema too. If you don't actually need the internal mechanisms, you can just

CREATE TABLE pulledtext ( id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, basetext text ); CREATE INDEX ON pulledtext ( to_tsvector('english', basetext) ); ## Index search like this, SELECT * FROM pulledtext WHERE to_tsvector('english', basetext) @@ to_tsquery('english', search_term');

There is also a to_tsquery that supports conditionals. It takes the exact form you used except & and | instead of AND and OR . ;)

SELECT to_tsquery('english', $$( 'audit' & 'trial' ) | 'automate'$$ ); to_tsquery ----------------------------- 'audit' & 'trial' | 'autom' (1 row)

Converting that into a query -- it'll also work on the index.

SELECT * FROM pulledtext WHERE to_tsvector('english', basetext) @@ to_tsquery( 'english', $$('audit' & 'trial') | 'automate'$$ ); id | basetext ----+------------------------- 1 | automate business audit 2 | audit trial 7 | automate this script 9 | automate this business (4 rows)

On naming, and not using double-quotes, I agree with all the advice given Erwin's answer.

Stack Exchange Network

A better way to write this query?

2 Answers 2

Full Text Search

Linked

Hot Network Questions

2 Answers 2

Linked

Related