Slow query in PostgreSQL selecting a single row from between a range defined in two columns

Question

I have imported a copy of the ip2location_db11 lite database, which contains 3,319,097 rows, and I am looking to optimize a numeric range query, where the low and high values are in separate columns of the table (ip_from, ip_to).

Importing the database:

CREATE TABLE ip2location_db11 ( ip_from bigint NOT NULL, -- First IP address in netblock. ip_to bigint NOT NULL, -- Last IP address in netblock. country_code character(2) NOT NULL, -- Two-character country code based on ISO 3166. country_name character varying(64) NOT NULL, -- Country name based on ISO 3166. region_name character varying(128) NOT NULL, -- Region or state name. city_name character varying(128) NOT NULL, -- City name. latitude real NOT NULL, -- City latitude. Default to capital city latitude if city is unknown. longitude real NOT NULL, -- City longitude. Default to capital city longitude if city is unknown. zip_code character varying(30) NOT NULL, -- ZIP/Postal code. time_zone character varying(8) NOT NULL, -- UTC time zone (with DST supported). CONSTRAINT ip2location_db11_pkey PRIMARY KEY (ip_from, ip_to) ); \copy ip2location_db11 FROM 'IP2LOCATION-LITE-DB11.CSV' WITH CSV QUOTE AS '"';

My first naive indexing attempt was to create separate indices on each of those columns, which resulted in a sequential scan with query times of 400ms:

account=> CREATE INDEX ip_from_db11_idx ON ip2location_db11 (ip_from); account=> CREATE INDEX ip_to_db11_idx ON ip2location_db11 (ip_to); account=> EXPLAIN ANALYZE VERBOSE SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on public.ip2location_db11 (cost=0.00..48930.99 rows=43111 width=842) (actual time=286.714..401.805 rows=1 loops=1) Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone Filter: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to)) Rows Removed by Filter: 3319096 Planning time: 0.155 ms Execution time: 401.834 ms (6 rows) account=> \d ip2location_db11 Table "public.ip2location_db11" Column | Type | Modifiers --------------+------------------------+----------- ip_from | bigint | not null ip_to | bigint | not null country_code | character(2) | not null country_name | character varying(64) | not null region_name | character varying(128) | not null city_name | character varying(128) | not null latitude | real | not null longitude | real | not null zip_code | character varying(30) | not null time_zone | character varying(8) | not null Indexes: "ip2location_db11_pkey" PRIMARY KEY, btree (ip_from, ip_to) "ip_from_db11_idx" btree (ip_from) "ip_to_db11_idx" btree (ip_to)

My second attempt was to create a multi-column btree index, which resulted in an index scan with query times of 290ms:

account=> CREATE INDEX ip_range_db11_idx ON ip2location_db11 (ip_from,ip_to); account=> EXPLAIN ANALYZE VERBOSE SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using ip_to_db11_idx on public.ip2location_db11 (cost=0.43..51334.91 rows=756866 width=69) (actual time=1.109..289.143 rows=1 loops=1) Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone Index Cond: ('2538629520'::bigint <= ip2location_db11.ip_to) Filter: ('2538629520'::bigint >= ip2location_db11.ip_from) Rows Removed by Filter: 1160706 Planning time: 0.324 ms Execution time: 289.172 ms (7 rows) n4l_account=> \d ip2location_db11 Table "public.ip2location_db11" Column | Type | Modifiers --------------+------------------------+----------- ip_from | bigint | not null ip_to | bigint | not null country_code | character(2) | not null country_name | character varying(64) | not null region_name | character varying(128) | not null city_name | character varying(128) | not null latitude | real | not null longitude | real | not null zip_code | character varying(30) | not null time_zone | character varying(8) | not null Indexes: "ip2location_db11_pkey" PRIMARY KEY, btree (ip_from, ip_to) "ip_from_db11_idx" btree (ip_from) "ip_range_db11_idx" btree (ip_from, ip_to) "ip_to_db11_idx" btree (ip_to)

Update: As requested in the comments, I have re-done the above query. The timing of the first 15 queries after re-creating the table (165ms, 65ms, 86ms, 83ms, 86ms, 64ms, 85ms, 811ms, 868ms, 845ms, 810ms, 781ms, 797ms, 890ms, 806ms):

account=> EXPLAIN (ANALYZE, VERBOSE, BUFFERS, TIMING) SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on public.ip2location_db11 (cost=28200.29..76843.12 rows=368789 width=842) (actual time=64.866..64.866 rows=1 loops=1) Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone Recheck Cond: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to)) Heap Blocks: exact=1 Buffers: shared hit=8273 -> Bitmap Index Scan on ip_range_db11_idx (cost=0.00..28108.09 rows=368789 width=0) (actual time=64.859..64.859 rows=1 loops=1) Index Cond: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to)) Buffers: shared hit=8272 Planning time: 0.099 ms Execution time: 64.907 ms (10 rows) account=> EXPLAIN (ANALYZE, VERBOSE, BUFFERS, TIMING) SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Seq Scan on public.ip2location_db11 (cost=0.00..92906.18 rows=754776 width=69) (actual time=577.234..811.757 rows=1 loops=1) Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone Filter: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to)) Rows Removed by Filter: 3319096 Buffers: shared hit=33 read=43078 Planning time: 0.667 ms Execution time: 811.783 ms (7 rows)

Sample rows from the imported CSV file:

"0","16777215","-","-","-","-","0.000000","0.000000","-","-" "16777216","16777471","AU","Australia","Queensland","Brisbane","-27.467940","153.028090","4000","+10:00" "16777472","16778239","CN","China","Fujian","Fuzhou","26.061390","119.306110","350004","+08:00"

Is there a better way to index this table that would improve the query, or is there a more efficient query that would get me the same result?

There is an example of how I solved lookups using a gist index at siafoo.net/article/53#comment_288. That might be of interest to you, at least as a starting point. — bma
– bma, Commented Nov 9, 2017 at 14:08
Why would you store ip4 in bigint to begin with? It fits fine in int4 — Evan Carroll
– Evan Carroll, Commented Nov 9, 2017 at 17:08
So the answer I provided has the proper methods to get to_ and from_ a signed int4 under the hood, and this method rather than doing that just throws things into a bigint. =) — Evan Carroll
– Evan Carroll, Commented Nov 9, 2017 at 17:42
You're right about the bigint casting stuff. I should not have supplied that link, it is outdated (though I think it predated ip4r). — bma
– bma, Commented Nov 9, 2017 at 18:43

Kent Chenery · Accepted Answer · 2017-11-10 10:43:07Z

This is a little bit of a different solution to those already offered which involved using spacial indexes to do some tricks.

Instead its worth remembering that with IP addresses you cannot have overlapping ranges. That is A -> B cannot intersect X -> Y in any way. Knowing this you can change your SELECT query slightly and take advantage of this trait. In taking advantage of this trait you do not have to have any "clever" indexing at all. In fact you only need to index your ip_from column.

Previously, the query being analyzed was:

SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;

Lets assume that the range that 2538629520 falls into happens to be 2538629512 and 2538629537.

Note: It really does not matter what the range is, this is just to help demonstrate the pattern we can take advantage of.

From this we can assume that the next ip_from value is 2538629538. We actually dont need to worry about any records above this ip_from value. Indeed all we actually care about is the range where ip_from equals 2538629512.

Knowing this fact, our query actually becomes (in English):

Find me the maximumip_from value where my IP address is higher than ip_from. Show me the record where you find this value.

Or in otherwords: Find me the ip_from value just before my IP address and give me that record

Because we never have overlapping ranges of ip_from to ip_to this holds true and allows us to write the query as:

SELECT * FROM ip2location WHERE ip_from = ( SELECT MAX(ip_from) FROM ip2location WHERE ip_from <= 2538629520 )

Back to the indexing to take advantage of all this. All we're actually looking across is ip_from and we are doing integer comparisons. The MIN(ip_from) have PostgreSQL find the first record available. This is good because we can seek right to that and then not worry about any other records at all.

All we really need is an index like:

CREATE UNIQUE INDEX CONCURRENTLY ix_ip2location_ipFrom ON public.ip2location(ip_from)

We can make the index unique because we will not have overlapping records. I would even make this column the primary key myself.

With this index and this query, the explain plan is:

Index Scan using ix_ip2location_ipfrom on public.ip2location (cost=0.90..8.92 rows=1 width=69) (actual time=0.530..0.533 rows=1 loops=1) Output: ip2location.ip_from, ip2location.ip_to, ip2location.country_code, ip2location.country_name, ip2location.region_name, ip2location.city_name, ip2location.latitude, ip2location.longitude, ip2location.zip_code, ip2location.time_zone Index Cond: (ip2location.ip_from = $1) InitPlan 2 (returns $1) -> Result (cost=0.46..0.47 rows=1 width=8) (actual time=0.452..0.452 rows=1 loops=1) Output: $0 InitPlan 1 (returns $0) -> Limit (cost=0.43..0.46 rows=1 width=8) (actual time=0.443..0.444 rows=1 loops=1) Output: ip2location_1.ip_from -> Index Only Scan using ix_ip2location_ipfrom on public.ip2location ip2location_1 (cost=0.43..35440.79 rows=1144218 width=8) (actual time=0.438..0.438 rows=1 loops=1) Output: ip2location_1.ip_from Index Cond: ((ip2location_1.ip_from IS NOT NULL) AND (ip2location_1.ip_from >= '2538629520'::bigint)) Heap Fetches: 0

To give you an idea of improvement in query performance with this approach I tested this on my Raspberry Pi. The original approach took approx 4 secs. This approach takes approx 120ms. The big win is from individual row seeks verses some scans. The original query would suffer EXTREMELY from low range values as more of the table needs to be considered in the results. This query will exhibit consistent performance across the range of values.

Hope this helps and my explanation makes sense to you all.

Please make it clear that your method works if (and only if) there are no overlapping ranges. — ypercubeᵀᴹ
– ypercubeᵀᴹ, Commented Nov 10, 2017 at 10:31
I have used similar queries: SELECT * FROM ( SELECT * FROM ip2location WHERE ip_from <= 2538629520 ORDER BY ip_from DESC LIMIT 1) AS t WHERE 2538629520 <= ip_to ; — ypercubeᵀᴹ
– ypercubeᵀᴹ, Commented Nov 10, 2017 at 10:34
@ypercubeᵀᴹ is that not clear? I mention several times that the IP address ranges do not overlap. If you do have range values that do overlap the query to handle that isn't to far removed from this one either. Would adding a bit at the end showing this scenario help? — Kent Chenery
– Kent Chenery, Commented Nov 10, 2017 at 10:44
You mention that they do not overlap but how do we know that the data in the OP's table do not overlap? — ypercubeᵀᴹ
– ypercubeᵀᴹ, Commented Nov 10, 2017 at 11:44
@ypercubeᵀᴹ Although the DB11 database does not have overlapping ranges, the ASN database provided by the same vendor (same general table structure) cannot be indexed using this method due to duplicate keys resulting from hierarchical subnets, so I would need to simply add the subnet field ('cidr') to make it a multi-column index (tried this and also get sub-1ms query times). — Autumn Skye
– Autumn Skye, Commented Nov 10, 2017 at 14:36

Autumn Skye · Accepted Answer · 2017-11-09 14:28:13Z

Thanks to a comment, I have a solution that reduced the query time to 0.073ms by using a gist spatial index and adjusting the query accordingly:

account=> DROP INDEX ip_to_db11_idx; account=> DROP INDEX ip_from_db11_idx; account=> DROP INDEX ip_range_db11_idx; account=> CREATE INDEX ip2location_db11_gist ON ip2location_db11 USING gist ((box(point(ip_from,ip_from),point(ip_to,ip_to))) box_ops); account=> EXPLAIN ANALYZE VERBOSE SELECT * FROM ip2location_db11 WHERE box(point(ip_from,ip_from),point(ip_to,ip_to)) @> box(point (2538629520,2538629520), point(2538629520,2538629520)); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on public.ip2location_db11 (cost=190.14..10463.13 rows=3319 width=69) (actual time=0.032..0.033 rows=1 loops=1) Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone Recheck Cond: (box(point((ip2location_db11.ip_from)::double precision, (ip2location_db11.ip_from)::double precision), point((ip2location_db11.ip_to)::double precision, (ip2location_db11.ip_to)::double precision)) @> '(2538629520,2538629520),(2538629520,2538629520)'::box) Heap Blocks: exact=1 -> Bitmap Index Scan on ip2location_db11_gist (cost=0.00..189.31 rows=3319 width=0) (actual time=0.022..0.022 rows=1 loops=1) Index Cond: (box(point((ip2location_db11.ip_from)::double precision, (ip2location_db11.ip_from)::double precision), point((ip2location_db11.ip_to)::double precision, (ip2location_db11.ip_to)::double precision)) @> '(2538629520,2538629520),(2538629520,2538629520)'::box) Planning time: 2.119 ms Execution time: 0.073 ms (8 rows)

Citations:

http://www.siafoo.net/article/53#comment_288

http://www.pgsql.cz/index.php/PostgreSQL_SQL_Tricks#Fast_interval_.28of_time_or_ip_addresses.29_searching_with_spatial_indexes

@Vérace Yes, using a gist index completely solved the performance issue. The index takes a while to build, but since this is a read-only table, it is not a problem. The queries that were taking 400ms are now taking less than 1ms, so I am able to perform joins to this table without any performance hit. I did need to re-write the query in order to use the index, but this was straightforward and I was able to use the approach in the linked article(s) almost verbatim. — Autumn Skye
– Autumn Skye, Commented Nov 9, 2017 at 15:18
@Vérace Oh, you were probably asking if the performance is still good on a cold start of the database. Yes, the execution was still less than 1ms: 0.617ms cold as opposed to 0.072ms warm. — Autumn Skye
– Autumn Skye, Commented Nov 9, 2017 at 15:23
@Vérace Cache or no cache, the performance over this index is so good that even joining this table over millions of records there is no noticeable performance hit. Super happy with this approach, was not expecting results like this. — Autumn Skye
– Autumn Skye, Commented Nov 10, 2017 at 1:33

Evan Carroll · Accepted Answer · 2017-11-09 15:33:16Z

`ip4r`

First, build add the extension (better instructions) on Github.

CREATE EXTENSION ip4r;

Let's start with almost the same thing you had before, create the ip types as ip4 instead. Make nothing a PRIMARY KEY and add no indexes on the types. We'll change the table after load.

CREATE TABLE ip2location_db11 ( ip_from ip4 NOT NULL, -- First IP address in netblock. ip_to ip4 NOT NULL, -- Last IP address in netblock. .... ); \copy ip2location_db11 FROM 'IP2LOCATION-LITE-DB11.CSV' WITH CSV QUOTE AS '"';

Now let's upgrade them to an ip4r

BEGIN; ALTER TABLE ip2location_db11 ADD iploc_range ip4r; UPDATE ip2location_db11 SET iploc_range = ip4r(ip_from,ip_to); ALTER TABLE ip2location_db11 DROP COLUMN ip_from, DROP COLUMN ip_to; COMMIT;

Now let's index it

CREATE INDEX ON ip2location_db11 USING gist (iploc_range); VACUUM ANALYZE ip2location_db11;

And query on it,

SELECT * FROM ip2location_db11 WHERE iploc_range >>= '1.2.3.4';

I don't really like the idea of adding extensions if not absolutely necessary, but this is an interesting solution. I particularly like the simplified query form. I'll keep this approach in mind for some other queries over complex types. Are there any other benefits to this approach, other than the simplified query? — Autumn Skye
– Autumn Skye, Commented Nov 9, 2017 at 16:57
The index size. There are a slew of operators that work on the index too. And you get additional functions over the type. It's also somewhat more semantic on the table. ips in dec is no fun. — Evan Carroll
– Evan Carroll, Commented Nov 9, 2017 at 17:09
Incidentally, I hope this extension or something like it makes it into the main PostgreSQL distribution at some point. I would definitely use this if it were available via default installation. — Autumn Skye
– Autumn Skye, Commented Nov 10, 2017 at 1:29

Stack Exchange Network

Slow query in PostgreSQL selecting a single row from between a range defined in two columns

3 Answers 3

`ip4r`

Hot Network Questions

3 Answers 3

Related