Scaling PostgreSQL with GridSQL
Who Am I? Jim Mlodgenski Co-organizer of NYCPUG
Founder of Cirrus Technologies
Former Chief Architect of EnterpriseDB
Agenda What is GridSQL?
Architecture
Query Flow
Scaling
Limitations
What is GridSQL? “ Shared-Nothing”, distributed data architecture. Leverage the power of multiple commodity servers while appearing as a single database to the application Essentially... Open Source
Greenplum, Netezza or Teradata
GridSQL Details Designed for Parallel Querying
Not just “Read-Only”, can execute UPDATE, DELETE
Data Loader for parallel loading
Standard connectivity via PostgreSQL compatible connectors: JDBC, ODBC, ADO.NET, libpq (psql)
What GridSQL is not? A replication solution like Slony or Bucardo
A high availability solution like Streaming Replication in PostgreSQL 9.0
A scalable transactional solution like PostgresXC
An elastic, eventually consistent NoSQL database
Configuration Can be configured for multiple logical “nodes” per physical server Take advantage of multi-core processors Tables may be either replicated or partitioned
Replicated tables for static lookup data or dimensions Partitioned tables for large fact tables
Partitioning Tables may simultaneously use GridSQL Partitioning with Constraint Exclusion Partitioning Large queries scan a much smaller subset of data by using subtables
Since each subtable is also partitioned across nodes, they are scanned in parallel
Queries execute much faster
Architecture Loosely coupled, shared-nothing architecture
Data repositories Metadata database
GridSQL database GridSQL processes Central coordinator
Agents
Query Optimization Cost Based Optimizer Takes into account Row Shipping (expensive) Looks for joins with replicated tables Can be done locally

Scaling PostgreSQL With GridSQL

  • 1.
    Scaling PostgreSQL with GridSQL
  • 2.
    Who Am I?Jim Mlodgenski Co-organizer of NYCPUG
  • 3.
    Founder of CirrusTechnologies
  • 4.
    Former Chief Architectof EnterpriseDB
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
    What is GridSQL?“ Shared-Nothing”, distributed data architecture. Leverage the power of multiple commodity servers while appearing as a single database to the application Essentially... Open Source
  • 11.
  • 12.
    GridSQL Details Designedfor Parallel Querying
  • 13.
    Not just “Read-Only”,can execute UPDATE, DELETE
  • 14.
    Data Loader forparallel loading
  • 15.
    Standard connectivity viaPostgreSQL compatible connectors: JDBC, ODBC, ADO.NET, libpq (psql)
  • 16.
    What GridSQL isnot? A replication solution like Slony or Bucardo
  • 17.
    A high availabilitysolution like Streaming Replication in PostgreSQL 9.0
  • 18.
    A scalable transactionalsolution like PostgresXC
  • 19.
    An elastic, eventuallyconsistent NoSQL database
  • 20.
    Configuration Can beconfigured for multiple logical “nodes” per physical server Take advantage of multi-core processors Tables may be either replicated or partitioned
  • 21.
    Replicated tables forstatic lookup data or dimensions Partitioned tables for large fact tables
  • 22.
    Partitioning Tables maysimultaneously use GridSQL Partitioning with Constraint Exclusion Partitioning Large queries scan a much smaller subset of data by using subtables
  • 23.
    Since each subtableis also partitioned across nodes, they are scanned in parallel
  • 24.
  • 25.
    Architecture Loosely coupled,shared-nothing architecture
  • 26.
  • 27.
    GridSQL database GridSQLprocesses Central coordinator
  • 28.
  • 29.
    Query Optimization CostBased Optimizer Takes into account Row Shipping (expensive) Looks for joins with replicated tables Can be done locally
  • 30.
    Looks for joinsbetween tables on partitioned columns
  • 31.
    Aggregation First setof aggregates done in parallel at the nodes
  • 32.
    Like groups ofintermediate results shipped to same target node
  • 33.
  • 34.
    Coordinator streams innode results, combining on the fly and sending to client result set, performing a merge sort if ORDER BY present
  • 35.
    Two Phase AggregationSUM SUM(stat1)
  • 36.
  • 37.
    SUM2 (SUM(stat1)) /SUM2 (COUNT(stat1))
  • 38.
    Creating Tables Tablescan be partitioned or replicated CREATE TABLE region (r_regionkey INTEGER NOT NULL, r_name CHAR(25) NOT NULL, r_comment VARCHAR(152)) REPLICATED;
  • 39.
    Creating Tables CREATETABLE orders ( o_orderkey INTEGER NOT NULL, o_custkey INTEGER NOT NULL, o_orderstatus CHAR(1) NOT NULL, o_totalprice DECIMAL(15,2) NOT NULL, o_orderdate DATE NOT NULL, o_orderpriority CHAR(15) NOT NULL, o_clerk CHAR(15) NOT NULL, o_shippriority INTEGER NOT NULL, o_comment VARCHAR(79) NOT NULL) PARTITIONING KEY o_orderkey ON ALL;
  • 40.
    DBT3 : Query 1SELECT l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order FROM lineitem WHERE l_shipdate <= date'1998-12-01' - interval '90 days' GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag, l_linestatus; Results l_returnflag | l_linestatus | sum_qty | sum_base_price | ... | count_order --------------+--------------+----------+----------------+ ... +------------- A | F | 37734104 | 56586654000 | ... | 1478493 N | F | 991417 | 1487505700 | ... | 38854 N | O | 74473520 | 111717540000 | ... | 2920374 R | F | 37719752 | 56567792000 | ... | 1478870 (4 rows)
  • 41.
    Query 1 –Execution (no Agents) Go to Animation Slide
  • 42.
    DBT3 : Query 7Results supp_nation | cust_nation | l_year | revenue ---------------------------+---------------------------+--------+-------------------- GERMANY | UNITED STATES | 1995 | 51883178.038909949 GERMANY | UNITED STATES | 1996 | 52528107.076993272 UNITED STATES | GERMANY | 1995 | 51546631.033109233 UNITED STATES | GERMANY | 1996 | 53108668.056805529 (4 rows) SELECT supp_nation, cust_nation, l_year, sum(volume) as revenue FROM (SELECT n1.n_name as supp_nation, n2.n_name as cust_nation, extract(year from l_shipdate) as l_year, l_extendedprice * (1 - l_discount) as volume FROM supplier, lineitem, orders, customer, nation n1, nation n2 WHERE s_suppkey = l_suppkey AND o_orderkey = l_orderkey AND c_custkey = o_custkey AND s_nationkey = n1.n_nationkey AND c_nationkey = n2.n_nationkey AND ((n1.n_name = 'GERMANY' and n2.n_name = 'UNITED STATES') or (n1.n_name = 'UNITED STATES' and n2.n_name = 'GERMANY')) AND l_shipdate between date '1995-01-01' and date '1996-12-31' ) AS shipping GROUP BY supp_nation, cust_nation, l_year ORDER BY supp_nation, cust_nation, l_year;
  • 43.
    Query 7 –Execution (with Agents) Go to Animation Slide
  • 44.
  • 45.
    Scalability A fewDBT3 queries on Amazon EC2 Using PostgreSQL 9.0
  • 46.
    Scalability SELECT l_returnflag,l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order FROM lineitem WHERE l_shipdate <= date'1998-12-01' - interval '90 days' GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag, l_linestatus;
  • 47.
    Scalability SELECT supp_nation,cust_nation, l_year, sum(volume) as revenue FROM (SELECT n1.n_name as supp_nation, n2.n_name as cust_nation, extract(year from l_shipdate) as l_year, l_extendedprice * (1 - l_discount) as volume FROM supplier, lineitem, orders, customer, nation n1, nation n2 WHERE s_suppkey = l_suppkey AND o_orderkey = l_orderkey AND c_custkey = o_custkey AND s_nationkey = n1.n_nationkey AND c_nationkey = n2.n_nationkey AND ((n1.n_name = 'GERMANY' and n2.n_name = 'UNITED STATES') or (n1.n_name = 'UNITED STATES' and n2.n_name = 'GERMANY')) AND l_shipdate between date '1995-01-01' and date '1996-12-31' ) AS shipping GROUP BY supp_nation, cust_nation, l_year ORDER BY supp_nation, cust_nation, l_year;
  • 48.
    Limitations SQL SupportUses its own parser and optimizer so: No Window Functions
  • 49.
  • 50.
  • 51.
  • 52.
    Transaction Performance Singlerow Insert, Update, or Delete are slow compared to a single PostgreSQL instance The data must make an additional network trip to be committed
  • 53.
    All partitioned rowsmust be hashed to be mapped to the proper node
  • 54.
    All replicated rowsmust be committed to all nodes Use “gs-loader” for bulk loading for better performance
  • 55.
    High Availability Noheartbeat or fail-over control in the coordinator High Availability for each PostgreSQL node must be configured separately
  • 56.
    Streaming replication canbe ideal for this Getting a consistent backup of the entire GridSQL database is difficult Must ensure there are no transaction are occurring
  • 57.
    Backup each nodeseparately
  • 58.
    Adding Nodes RequiresDowntime Data must be manually reloaded to partition the data to the new node With planning, the process can be fast with no mapping of data Run multiple PostgreSQL instances on each physical server and move the PostgreSQL instances to new hardware as needed
  • 59.
    Interesting Side NoteGridSQL scales well in a cloud environment
  • 60.
    The results aredependent on the cloud vendor
  • 61.
    Summary GridSQL canimprove performance tremendously of PostgreSQL queries
  • 62.
    GridSQL can scalelinearly as more nodes are added
  • 63.
    GridSQL is opensource so if the limitations are an issue,
  • 64.
  • 65.
    Download GridSQL at:http://sourceforge.net/projects/gridsql/ Jim Mlodgenski Email: [email_address] Twitter: @jim_mlodgenski