1

I am trying to join two large tables (t1: 4.5mio and t2: 100mio. rows) and to collect the information in a new table. It takes ages (>24h). I am not an expert so I don't know how to read the query plan and what to optimize.

This is my command:

CREATE table new as SELECT DISTINCT t1.*, t2.publn_nr, t2.publn_nr_original FROM addresses_googleresponse t1 JOIN patstat2.tls211_pat_publn t2 ON t1.appln_id = t2.appln_id 

t1.appln_id and t2.appln_id are indexed.

query plan:

Unique (cost=2335238007.81..2344507241.10 rows=72699869 width=14874)" -> Sort (cost=2335238007.81..2335419757.48 rows=72699869 width=14874)" Sort Key: t1.location_id, t1.id, t1.unformattedaddress, t1.ts, t1.googleresponse, t1.parsegoogleresponsestatus, t1.numresults, t1.formattedaddress, t1.lat, t1.lng, t1.maintype, t1.types, t1.viewportarea, t1.administrative_area_level_1, t1.administr (...)" -> Hash Join (cost=121.82..11099071.55 rows=72699869 width=14874)" Hash Cond: (t1.appln_id = t2.appln_id)" -> Seq Scan on addresses_googleresponse t1 (cost=0.00..8410623.64 rows=52302064 width=14608)" -> Hash (cost=118.34..118.34 rows=278 width=270)" -> Foreign Scan on tls211_pat_publn t2 (cost=100.00..118.34 rows=278 width=270)" 

Answers to questions below:

t1 is not a temporary table, no (I changed the name to avoid confusion). It has an index on appln_id as described. Table definitions:

create table addresses_googleresponse ( id int, unformattedAddress varchar(500), ts timestamp, googleResponse text, parseGoogleResponseStatus text, numResults integer, formattedAddress varchar(500), lat real, lng real, mainType varchar(200), types text, viewportArea real, administrative_area_level_1 varchar(200), administrative_area_level_2 varchar(200), administrative_area_level_3 varchar(200), administrative_area_level_4 varchar(200), administrative_area_level_5 varchar(200), airport varchar(200), country varchar(200), establishment varchar(200), floor varchar(200), locality varchar(200), natural_feature varchar(200), neighborhood varchar(200), park varchar(200), point_of_interest varchar(200), post_box varchar(200), postal_code varchar(200), postal_code_prefix varchar(200), postal_code_suffix varchar(200), postal_town varchar(200), premise varchar(200), route varchar(200), street_address varchar(200), street_number varchar(200), sublocality varchar(200), sublocality_level_1 varchar(200), sublocality_level_2 varchar(200), sublocality_level_3 varchar(200), sublocality_level_4 varchar(200), sublocality_level_5 varchar(200), subpremise varchar(200), ward varchar(200), data_source varchar(2), location_id int, appln_id int ); CREATE TABLE patstat.tls211_pat_publn ( pat_publn_id integer NOT NULL DEFAULT 0, publn_auth character(2) NOT NULL DEFAULT ''::bpchar, publn_nr character varying(15) NOT NULL DEFAULT ''::character varying, publn_nr_original character varying(100) NOT NULL DEFAULT ''::characte varying, publn_kind character(2) NOT NULL DEFAULT ''::bpchar, appln_id integer NOT NULL DEFAULT 0, publn_date date NOT NULL DEFAULT '9999-12-31'::date, publn_lg character(2) NOT NULL DEFAULT ''::bpchar, publn_first_grant smallint DEFAULT (0)::smallint, publn_claims smallint NOT NULL DEFAULT (0)::smallint, CONSTRAINT tls211_pat_publn_pkey PRIMARY KEY (pat_publn_id) ) 

The goal is to add publn_nr and publn_nr_original to the other columns in t1 based on the appln_id both tables share.

Is update of t1.publn_nr, t1.publn_nr_original an option / faster (after adding the columns to t1?

I am trying without distinct now, but it is already running for some hours.

Postgresql version 9.6.5.

8
  • The distinct operation could also be hitting you. Do you need to select distinct? Commented May 28, 2018 at 13:22
  • width=14874 looks suspect. ANYWAY: please add table definitions, etc. Commented May 28, 2018 at 13:37
  • what do you mean with table definitions? Commented May 28, 2018 at 13:58
  • joop refers to the create table statements. The distinct is a bit of a code smell the way you are using it (especially not on 72699869 rows). You should take a step back and describe what the real problem is you are trying to solve. Maybe there is better way to solve this.Please edit your question and add some sample data and the expected output based on that data. Formatted text please, no screen shots. Commented May 28, 2018 at 14:29
  • Hi, several questions - table name "addresses_googleresponse_temp_2" sounds like temporary table. Does this table has some indexes? What version of PostgreSQL you use, could you add your setting of "work_mem"? Commented May 28, 2018 at 14:30

1 Answer 1

1

For a definitive answer, EXPLAIN (ANALYZE, BUFFERS) output is needed.

However, if PostgreSQL's estimates are correct, then your problem is the DISTINCT: PostgreSQL has to sort 72699869 rows to remove duplicates, and that is where the time is spent. See the cost estimates.

If you can get rid of DISTINCT, do so. If not, try to crank up work_mem and see if it makes processing somewhat faster.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.