1

We've been tasked to restore one large database on daily basis for analytics purpose.

Here is something that we cannot change:

  • Vendor provided database dump file on daily basis on S3
  • We cannot control how the database get dumped. We only get the dump file.
  • We don't have the option to do replication or access to the database

What we can control / do :

  • We can shutdown the DB / kick the user out
  • We have full control of hardware / VM for this process

Current process:

We use a VM hosted in VMC ( Vmware cloud ). When we detected a new database dump fiel in s3 - we spin up a VM with maximum resources one ESX host can take. Then fully allocate all CPU and memory to it, then run the restore. Once resource completed , we shutdown the VM and allocate a more "appropriate" resource for it and allow user in.

I managed to optimise this in less than 3 hours, however we would like to avoid this process because it affect performance on another VMs in the ESX cluster when we take one whole host

New process:

We provision a new a bare metal physical server with more powerful CPU, bigger memory and locally attached enterprise SSD disk.

Issue:

Restore 5 to 6 hours when run on the Physical server using the same configuration as the VM.

Server Spec :

Server Spec

Restore command:

pg_restore -vvv -U postgres -j 40 -d $DB ~/$DB.pgdump 

pg_dump TOC :

; Archive created at 2025-09-14 16:08:54 AEST ; dbname: database ; TOC Entries: 8221 ; Compression: none ; Dump Version: 1.14-0 ; Format: CUSTOM ; Integer: 4 bytes ; Offset: 8 bytes ; Dumped from database version: 14.15 ; Dumped by pg_dump version: 14.19 (Ubuntu 14.19-1.pgdg22.04+1) 

postgresql.conf (restore mode)

listen_addresses = '' max_connections = 98 ssl = off shared_buffers = 16GB work_mem = 64MB maintenance_work_mem = 64GB wal_level = minimal fsync = off synchronous_commit = off full_page_writes = off wal_buffers = 16MB checkpoint_timeout = 2h max_wal_size = 16GB archive_mode = off max_wal_senders = 0 

Observation during restore :

  • pg_restore processes seems to be the one using up most CPU resources. At times where multiple postgres processes use 0% CPU , and multiple pg_restore prcesses uses 100% CPU
  • No storage contention - majority of the time I/O wait time is 0.

I'm at lost as to why this is happening. I tried to increase paralism but it doesn't help. Any ideas on how to speed things up ?

5
  • Use less parallelism. Commented Sep 15 at 12:13
  • Thanks for the suggestion! May I know the reasoning behind the suggestion? I'm also trying to understand why on the VM using -j 40 is faster than on physical server? Following are the result of using less parallelism : -j 10 : 3 hours 8 minutes -j 20 : 3 hours 6 minutes Will try with -j 5 and -j 2 shortly ...... Commented Sep 15 at 22:06
  • 1
    This is just guesswork. You will have to investigate this problem yourself. One of the things to try is if less parallelism (which removes internal contention) will help. Essentially, you have to do profiling: investigate where the time is spent.s probably perf. The problem could be anywhere: hardware, operating system configuration, database. A Q&A forum is not suited for investigations like that. Commented Sep 16 at 6:06
  • -j 5 : 3 hours 42 minutes -j 2 : 5 hours 3 minutes Commented Sep 16 at 10:58
  • so -j=40 results in a 5 to 6 hours restore time, while -j 5 results in 3 hours 42 minutes. That tells you that 40 is too much; it's not the best performing setting. Try 10 or 20, and monitor the progress to identify the bottleneck and determine when you achieve the best performance. Commented Sep 16 at 18:39

1 Answer 1

0

The discrepancies between the 2 servers were caused by a scripting bug in my restore script where an external script shut down the VM prematurely.

The reason postgres server used 0% of the CPU at various points of the restore ( happen more at the last leg of the restore ) was because the server was just waiting for incoming data.

It's something to do on how the dump was taken. There is undocumented process in pg_dump which I think deserves more publicity. The following are a quote from Tom Lane:

The TOC is initially written out with zeroes for the offsets. Then the per-table data parts are written out, tracking where each one begins. At the end, if the output file is seekable, pg_dump seeks back to the start and re-writes the whole TOC section, now with data offsets populated. But output to a pipe won't be seekable. <source>

In my case, the dump was taken using pipe to lbzip. Towards the end of the restore process, a bunch of pg_restore processes consuming 100% of the CPU seeking the next object to restore.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.