I have the following schema for my database (Postgresql 10):
CREATE TABLE "PulledTexts" ( "Id" serial PRIMARY KEY, "BaseText" TEXT, "CleanText" TEXT ); CREATE TABLE "UniqueWords" ( "Id" serial PRIMARY KEY, "WordText" TEXT ); CREATE TABLE "WordTexts" ( "Id" serial PRIMARY KEY, "TextIdId" INTEGER REFERENCES "PulledTexts", "WordIdId" INTEGER REFERENCES "UniqueWords" ); CREATE INDEX "IX_WordTexts_TextIdId" ON "WordTexts" ("TextIdId"); CREATE INDEX "IX_WordTexts_WordIdId" ON "WordTexts" ("WordIdId"); Some sample data:
INSERT INTO public."PulledTexts" ("Id", "BaseText", "CleanText") VALUES (1, 'automate business audit', null), (2, 'audit trial', null), (3, 'trial', null), (4, 'audit', null), (5, 'fresh report', null), (6, 'fresh audit', null), (7, 'automate this script', null), (8, 'im trying here', null), (9, 'automate this business', null), (10, 'lateral', null); INSERT INTO public."UniqueWords" ("Id", "WordText") VALUES (1, 'trial'), (2, 'audit'), (3, 'creation'), (4, 'business'), (5, 'automate'); INSERT INTO public."WordTexts" ("Id", "TextIdId", "WordIdId") VALUES (1, 1, 2), (2, 1, 4), (3, 1, 5), (4, 2, 1), (5, 3, 1), (6, 4, 2), (7, 6, 2), (8, 7, 5), (9, 9, 4), (10, 9, 5), (11, 2, 2); The database itself is created through Entity framework migration for now.
I would like to know if there's a better, specifically more performance efficient way of writing this query because the WordTexts table will contain hundreds of thousands of records and eventually millions. I am also open to going a NoSql route if that would be more efficient for these kinds of queries.
SELECT * FROM "PulledTexts" WHERE "Id" IN ( SELECT "TextIdId" FROM "WordTexts" AS "wordTexts" LEFT JOIN "UniqueWords" AS "wordTexts.WordId" ON "wordTexts"."WordIdId" = "wordTexts.WordId"."Id" WHERE "wordTexts.WordId"."WordText" = 'automate' OR "TextIdId" IN ( SELECT "TextIdId" and1 from "WordTexts" AS "wordTexts" LEFT JOIN "UniqueWords" AS "wordTexts.WordId" ON "wordTexts"."WordIdId" = "wordTexts.WordId"."Id" where "wordTexts.WordId"."WordText" = 'audit' INTERSECT SELECT "TextIdId" and2 from "WordTexts" AS "wordTexts" LEFT JOIN "UniqueWords" AS "wordTexts.WordId" ON "wordTexts"."WordIdId" = "wordTexts.WordId"."Id" WHERE "wordTexts.WordId"."WordText" = 'trial' ) ); At a high level, this query is supposed to return PulledTexts whose IDs match those returned by the subqueries. The subqueries are simply meant to return the list of IDs of PulledTexts that contain lets say ('audit' AND 'trial') OR 'automate' from the UniqueWords table. This is exactly what the example query I gave does. The WordTexts table is a simple mapping of UniqueWord to PulledText.