1

I'm trying PostgreSQL 8.4.14 for storing triples, pieces of data of the form (String, String, String).
For speed, I'm not repeatedly storing strings but rather using two tables:

  • main table triples (subject BigInt, predicate BigInt, object BigInt)
  • lookup table entities (entityId BigInt, name Varying(40000))

I've added indexes and foreign keys on triples:

  • "nsubject_idx" hash (subject)
  • "npredicate_idx" btree (predicate)
  • "nobject_idx" hash (object)
  • "triples_subject_fkey" FOREIGN KEY (subject) REFERENCES entities(entityid)
  • "triples_predicate_fkey" FOREIGN KEY (predicate) REFERENCES entities(entityid)
  • "triples_object_fkey" FOREIGN KEY (object) REFERENCES entities(entityid)

and also indexes on entities:

  • "entities_pkey" PRIMARY KEY, btree (entityid) CLUSTER
  • "name_idx" hash (name)

Now it would be reasonable to assume that lookups are fast. They aren't.
With 151M elements in triples and 44M in entities, the following query is immensely slow:

SELECT entityId FROM entities WHERE entityId in (SELECT object FROM triples LIMIT 10000); 

It's only 10.000 lookups, so I'd expect this to complete at high speed. Query plan:

 Nested Loop (cost=288.69..3856.26 rows=43806140 width=8) (actual time=25.226..40110.699 rows=6959 loops=1) -> HashAggregate (cost=288.69..290.69 rows=200 width=8) (actual time=19.445..24.087 rows=6959 loops=1) -> Limit (cost=0.00..163.69 rows=10000 width=8) (actual time=0.013..15.792 rows=10000 loops=1) -> Seq Scan on triples (cost=0.00..2474009.68 rows=151135968 width=8) (actual time=0.012..14.101 rows=10000 loops=1) -> Index Scan using entities_pkey on entities (cost=0.00..17.82 rows=1 width=8) (actual time=5.756..5.759 rows=1 loops=6959) Index Cond: (entities.entityid = triples.object) Total runtime: 40112.383 ms 

What would be happening here?

Note that this is even a trick query: because of the foreign key constraint, it is actually equivalent to SELECT object FROM triples LIMIT 10000. For my use case, I'd need the actual lookup.

4
  • 2
    What do you see if you create a btree index on triples.object? Also, which PostgreSQL version do you use? Commented Oct 17, 2013 at 13:25
  • I added the btree index, removed the hash index, analyzed again but no noticeable difference. Commented Oct 17, 2013 at 19:47
  • Index lookups are reported to take 5.75ms, which indeed is slow. Try explain (analyze,buffers) select... to get information on how much is actually read from disk versus buffers. I would run vmstat 1 during the query to also see at which rate data is pulled from disk during these 40s and how much I/O-wait occurs. Commented Oct 18, 2013 at 15:27
  • @DanielVérité It's PostgreSQL 8, so unfortunately I can't do buffers. Here is the output of vmstat 1: pastebin.com/dLmuJmgY Commented Oct 18, 2013 at 16:47

1 Answer 1

2

How do the following queries perform?

SELECT e.entityId FROM entities e INNER JOIN triples t ON (t.object = e.entityId) LIMIT 10000; 

or

SELECT e.entityId FROM entities e WHERE EXISTS (SELECT 1 FROM triples t WHERE t.object = e.entityId LIMIT 10000); 
3
  • 1. I was told TEXT and VARCHAR are the same (stackoverflow.com/a/4849030/476820) 2. I understand, but are those reasons related to performance? Will try BTREE now. 3. I ran ANALYZE on both tables. 4. Those queries are equally slow unfortunately (actually, the last one performs worse). Commented Oct 17, 2013 at 15:42
  • TEXT and unconstrained VARCHAR are virtually the same, that is true. You have an upper bound on the VARCHAR length though, but I doubt that has any appreciable impact on your query. Commented Oct 17, 2013 at 16:05
  • 1
    No difference with BTREE for my query. But the first query you suggested goes way faster now! Commented Oct 17, 2013 at 19:47

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.