Overview of postgres cluster solutions Stas Kelvich, @kelvich
Let’s define it just as database that utilizes several servers. Design space is huge. Let’s also put some constraints: Do we want to improve perfomance? Do we want to improve availability? Is load will be more like OLAP or OLTP? Or both? What consistency guarantees do we need? Contradictory topics! What do we mean by distributed DB or cluster? 2
Master-slave replication Postgres-XC/XC2/XL Greenplum CitusDB / pg_shard BDR / pg_logical ... Postgres clustering 3
Async. Muster just streams journal (WAL) to slaves. Slave can lag. Sync. Master waits for WAL sync on replica. But still we can read old data from replica – time gap between WAL fsync and files delivery. Sync + remote_apply (new in 9.6). Wait for data to be delivered to files on replica. Perserves causality. Master-slave replication 4
Pros: Improves availability Read-only queries can be sent to replica – improves performance in that sence. Cons: Automatic failover is hard to achive and external tools are required. (pacemaker, patroni, etc) Read-only queries on replica are limited: no temp tables, no catalog access. Master-slave replication 5
Postgres-XC distributed db grew out of postgres sources. Originally for write scalability. Now development continues in Huawei (Postgres-XC2) and 2ndQuadrant (Postgres-XL) Let’s concentrate on Postgres-XL: postgres fork built for write scalability and OLAP performance. distributed transactions 3 types of nodes in system: coordinator, node, gtm (+ gtm-proxy) Postgres-XC/XC2/XL 6
Postgres-XL 7
Pros: now merged with 9.5 good for OLAP good for write-heavy load Pros: fork – isn’t maintained by postgres core team isn’t good for high TPS GTM is a hot spot Postgres-XL 8
MPP database by Pivotal. Opensourced in 2015. Built on top of postgres 8.3. Same niche as a Postgres-XL, but more mature. Greenplum 9
Also sharding + analytics (distributed joins) Not a fork! Extension to an ordinary postgres! Isn’t good for high TPS (master instances is a hot spot) 2 type of nodes: master-node and data-node No distributed transactions (and hence no proper isolation) CitusDB 10
In-core extension to access remote node as a local table. In core Some distributed planning (aggregate pushdown) Can be used for sharding (along with standand table inheritance mechanism) Neither atomical nor isolated Postgres_fdw 11
Bi-Directional Replication. Fork but most of the features are merged into postgres core. Multimaster replication, but without conflict detection. (No distributed transactions, no distributed queries). Very useful if application can determine node to access itself. Good for georeplication. BDR 12
Was part of BDR, now shipped as an extension. Can decode journal (WAL) into sequence of logical operations in custom format. Libpq binary and json formats support included. Server-side events (filter and send to WebSocket connections) Replication between different versions Replication to other databases! Part of BDR pg_logical + decoding plugin 13
Started about a year ago. Konstantin Knizhnik, Constantin Pan, Stas Kelvich Cluster group in PgPro 14
Started to playing with Postgres-XC. 2ndQuadrant also had project (finished now) to port XC to 9.5. Fork is painful; How can we bring functionality of XC in core? Cluster group in PgPro 15
Distributed transactions - nothing in-core; Distributed planner - fdw, pg_shard, greenplum planner (?); HA/Autofailover - can be built on top of logical decoding. Distributed postgres 16
acc1 {5, 0, 1, 500} {5, 42, 1, 500} {42, 0, 1, 600} T42 r1(500) w1(600) r2(600) Comm T43 r2(500) w1(400) Comm acc2 {7, 0, 2, 500} {7, 43, 1, 500} {43, 0, 1, 400} MVCC / snapshot isolation 17
SQL-92 Уровни изоляции 18
(Peter Bailis, VLDB, 2013) Уровни изоляции 19
create extension pg_dtm; select begin_glob...(); begin update ...; commit; create extension pg_dtm; select join_glob...(xid); begin; update ...; commit; src server dst server Client side of view 20
Achieve proper isolation between tx for multi-node transactions. Now in postgres on write tx start: Aquire XID; Get list of running tx’s; Use that info in visibility checks. Distributed transactions 21
transam/clog.c: GetTransactionStatus SetTransactionStatus transam/varsup.c: GetNewTransactionId ipc/procarray.c: TransactionIdIsInProgress GetOldestXmin GetSnapshotData time/tqual.c: XidInMVCCSnapshot XTM API: vanilla 22
transam/clog.c: GetTransactionStatus SetTransactionStatus transam/varsup.c: GetNewTransactionId ipc/procarray.c: TransactionIdIsInProgress GetOldestXmin GetSnapshotData time/tqual.c: XidInMVCCSnapshot Transaction Manager XTM API: after patch 23
transam/clog.c: GetTransactionStatus SetTransactionStatus transam/varsup.c: GetNewTransactionId ipc/procarray.c: TransactionIdIsInProgress GetOldestXmin GetSnapshotData time/tqual.c: XidInMVCCSnapshot Transaction Manager pg_dtm.so XTM API: after tm load 24
Aquire XID centrally (DTMd, arbiter); No local tx possible; DTMd is a bottleneck. XTM implementations GTM or snapshot sharing 25
Paper from SAP HANA team; Central daemon is needed, but only for multi-node tx; Snapshots -> Commit Sequence Number; DTMd is still a bottleneck. XTM implementations Incremental SI 26
XID/CSN are gathered from all nodes that participates in tx; No central service; local tx; possible to reduce communication by using time (Spanner, CockroachDB). XTM implementations Clock-SI or tsDTM 27
XTM implementations tsDTM scalability 28
More nodes, higher probability of failure in system. Possible problems with nodes: Node stopped (and will not be back); Node was down small amount of time (and we should bring it back to operation); Network partitions (avoid split-brain). If we want to survive network partitions than we can have not more than [N/2] - 1 failures. HA/autofailover 29
Possible usage of such system: Multimaster replication; Tables with metainformation in sharded databases; Sharding with redundancy. HA/autofailover 30
By Multimaster we mean strongly coupled one, that acts as a single database. With proper isolation and no merge conflicts. Ways to build: Global order to XLOG (Postgres-R, MySQL Galera); Wrap each tx as distributed – allows parallelism while applying tx. Multimaster 31
Our implementation: Built on top of pg_logical; Make use of tsDTM; Pool of workers for tx replay; Raft-based storage for dealing with failures and distributed deadlock detection. Multimaster 32
Our implementation: Approximately half of a speed of standalone postgres; Same speed for reads; Deals with nodes autorecovery; Deals with network partitions (debugging right now). Can work as an extension (if community accept XTM API in core). Multimaster 33

Postgres clusters