I'm trying PostgreSQL 8.4.14 for storing triples, pieces of data of the form (String, String, String).
For speed, I'm not repeatedly storing strings but rather using two tables:
- main table
triples(subject BigInt, predicate BigInt, object BigInt) - lookup table
entities(entityId BigInt, name Varying(40000))
I've added indexes and foreign keys on triples:
- "nsubject_idx" hash (subject)
- "npredicate_idx" btree (predicate)
- "nobject_idx" hash (object)
- "triples_subject_fkey" FOREIGN KEY (subject) REFERENCES entities(entityid)
- "triples_predicate_fkey" FOREIGN KEY (predicate) REFERENCES entities(entityid)
- "triples_object_fkey" FOREIGN KEY (object) REFERENCES entities(entityid)
and also indexes on entities:
- "entities_pkey" PRIMARY KEY, btree (entityid) CLUSTER
- "name_idx" hash (name)
Now it would be reasonable to assume that lookups are fast. They aren't.
With 151M elements in triples and 44M in entities, the following query is immensely slow:
SELECT entityId FROM entities WHERE entityId in (SELECT object FROM triples LIMIT 10000); It's only 10.000 lookups, so I'd expect this to complete at high speed. Query plan:
Nested Loop (cost=288.69..3856.26 rows=43806140 width=8) (actual time=25.226..40110.699 rows=6959 loops=1) -> HashAggregate (cost=288.69..290.69 rows=200 width=8) (actual time=19.445..24.087 rows=6959 loops=1) -> Limit (cost=0.00..163.69 rows=10000 width=8) (actual time=0.013..15.792 rows=10000 loops=1) -> Seq Scan on triples (cost=0.00..2474009.68 rows=151135968 width=8) (actual time=0.012..14.101 rows=10000 loops=1) -> Index Scan using entities_pkey on entities (cost=0.00..17.82 rows=1 width=8) (actual time=5.756..5.759 rows=1 loops=6959) Index Cond: (entities.entityid = triples.object) Total runtime: 40112.383 ms What would be happening here?
Note that this is even a trick query: because of the foreign key constraint, it is actually equivalent to SELECT object FROM triples LIMIT 10000. For my use case, I'd need the actual lookup.
explain (analyze,buffers) select...to get information on how much is actually read from disk versus buffers. I would runvmstat 1during the query to also see at which rate data is pulled from disk during these 40s and how much I/O-wait occurs.buffers. Here is the output ofvmstat 1: pastebin.com/dLmuJmgY