Postgres clusters

Overview of postgres cluster solutions Stas Kelvich, @kelvich

Let’s deﬁne it just as database that utilizes several servers. Design space is huge. Let’s also put some constraints: Do we want to improve perfomance? Do we want to improve availability? Is load will be more like OLAP or OLTP? Or both? What consistency guarantees do we need? Contradictory topics! What do we mean by distributed DB or cluster? 2

Master-slave replication Postgres-XC/XC2/XL Greenplum CitusDB / pg_shard BDR / pg_logical ... Postgres clustering 3

Async. Muster just streams journal (WAL) to slaves. Slave can lag. Sync. Master waits for WAL sync on replica. But still we can read old data from replica – time gap between WAL fsync and ﬁles delivery. Sync + remote_apply (new in 9.6). Wait for data to be delivered to ﬁles on replica. Perserves causality. Master-slave replication 4

Pros: Improves availability Read-only queries can be sent to replica – improves performance in that sence. Cons: Automatic failover is hard to achive and external tools are required. (pacemaker, patroni, etc) Read-only queries on replica are limited: no temp tables, no catalog access. Master-slave replication 5

Postgres-XC distributed db grew out of postgres sources. Originally for write scalability. Now development continues in Huawei (Postgres-XC2) and 2ndQuadrant (Postgres-XL) Let’s concentrate on Postgres-XL: postgres fork built for write scalability and OLAP performance. distributed transactions 3 types of nodes in system: coordinator, node, gtm (+ gtm-proxy) Postgres-XC/XC2/XL 6

Pros: now merged with 9.5 good for OLAP good for write-heavy load Pros: fork – isn’t maintained by postgres core team isn’t good for high TPS GTM is a hot spot Postgres-XL 8

MPP database by Pivotal. Opensourced in 2015. Built on top of postgres 8.3. Same niche as a Postgres-XL, but more mature. Greenplum 9

Also sharding + analytics (distributed joins) Not a fork! Extension to an ordinary postgres! Isn’t good for high TPS (master instances is a hot spot) 2 type of nodes: master-node and data-node No distributed transactions (and hence no proper isolation) CitusDB 10

In-core extension to access remote node as a local table. In core Some distributed planning (aggregate pushdown) Can be used for sharding (along with standand table inheritance mechanism) Neither atomical nor isolated Postgres_fdw 11

Bi-Directional Replication. Fork but most of the features are merged into postgres core. Multimaster replication, but without conﬂict detection. (No distributed transactions, no distributed queries). Very useful if application can determine node to access itself. Good for georeplication. BDR 12

Was part of BDR, now shipped as an extension. Can decode journal (WAL) into sequence of logical operations in custom format. Libpq binary and json formats support included. Server-side events (ﬁlter and send to WebSocket connections) Replication between diﬀerent versions Replication to other databases! Part of BDR pg_logical + decoding plugin 13

Started about a year ago. Konstantin Knizhnik, Constantin Pan, Stas Kelvich Cluster group in PgPro 14

Started to playing with Postgres-XC. 2ndQuadrant also had project (ﬁnished now) to port XC to 9.5. Fork is painful; How can we bring functionality of XC in core? Cluster group in PgPro 15

Distributed transactions - nothing in-core; Distributed planner - fdw, pg_shard, greenplum planner (?); HA/Autofailover - can be built on top of logical decoding. Distributed postgres 16

acc1 {5, 0, 1, 500} {5, 42, 1, 500} {42, 0, 1, 600} T42 r1(500) w1(600) r2(600) Comm T43 r2(500) w1(400) Comm acc2 {7, 0, 2, 500} {7, 43, 1, 500} {43, 0, 1, 400} MVCC / snapshot isolation 17

SQL-92 Уровни изоляции 18

(Peter Bailis, VLDB, 2013) Уровни изоляции 19

create extension pg_dtm; select begin_glob...(); begin update ...; commit; create extension pg_dtm; select join_glob...(xid); begin; update ...; commit; src server dst server Client side of view 20

Achieve proper isolation between tx for multi-node transactions. Now in postgres on write tx start: Aquire XID; Get list of running tx’s; Use that info in visibility checks. Distributed transactions 21

transam/clog.c: GetTransactionStatus SetTransactionStatus transam/varsup.c: GetNewTransactionId ipc/procarray.c: TransactionIdIsInProgress GetOldestXmin GetSnapshotData time/tqual.c: XidInMVCCSnapshot XTM API: vanilla 22

transam/clog.c: GetTransactionStatus SetTransactionStatus transam/varsup.c: GetNewTransactionId ipc/procarray.c: TransactionIdIsInProgress GetOldestXmin GetSnapshotData time/tqual.c: XidInMVCCSnapshot Transaction Manager XTM API: after patch 23

transam/clog.c: GetTransactionStatus SetTransactionStatus transam/varsup.c: GetNewTransactionId ipc/procarray.c: TransactionIdIsInProgress GetOldestXmin GetSnapshotData time/tqual.c: XidInMVCCSnapshot Transaction Manager pg_dtm.so XTM API: after tm load 24

Aquire XID centrally (DTMd, arbiter); No local tx possible; DTMd is a bottleneck. XTM implementations GTM or snapshot sharing 25

Paper from SAP HANA team; Central daemon is needed, but only for multi-node tx; Snapshots -> Commit Sequence Number; DTMd is still a bottleneck. XTM implementations Incremental SI 26

XID/CSN are gathered from all nodes that participates in tx; No central service; local tx; possible to reduce communication by using time (Spanner, CockroachDB). XTM implementations Clock-SI or tsDTM 27

XTM implementations tsDTM scalability 28

More nodes, higher probability of failure in system. Possible problems with nodes: Node stopped (and will not be back); Node was down small amount of time (and we should bring it back to operation); Network partitions (avoid split-brain). If we want to survive network partitions than we can have not more than [N/2] - 1 failures. HA/autofailover 29

Possible usage of such system: Multimaster replication; Tables with metainformation in sharded databases; Sharding with redundancy. HA/autofailover 30

By Multimaster we mean strongly coupled one, that acts as a single database. With proper isolation and no merge conﬂicts. Ways to build: Global order to XLOG (Postgres-R, MySQL Galera); Wrap each tx as distributed – allows parallelism while applying tx. Multimaster 31

Our implementation: Built on top of pg_logical; Make use of tsDTM; Pool of workers for tx replay; Raft-based storage for dealing with failures and distributed deadlock detection. Multimaster 32

Our implementation: Approximately half of a speed of standalone postgres; Same speed for reads; Deals with nodes autorecovery; Deals with network partitions (debugging right now). Can work as an extension (if community accept XTM API in core). Multimaster 33

Postgres clusters

More Related Content

What's hot

Viewers also liked

Similar to Postgres clusters

Recently uploaded

Postgres clusters