HeteroDB, Inc. specializes in GPU- and SSD-accelerated SQL database products, primarily utilizing the pg-strom extension for PostgreSQL to enhance query processing speeds. Founded by experts in the database and GPU acceleration fields, the company provides solutions for heavy SQL workloads and in-database analytics, targeting enterprises needing efficient data processing. The technology includes auto-generation of GPU programs and direct execution from SSDs to GPUs, significantly improving throughput for complex queries and analytics workloads.
GPU&SSD Accelerates PostgreSQL -Challenge towards query processing throughput 10GB/s - HeteroDB,Inc Chief Architect & President KaiGai Kohei <kaigai@heterodb.com>
2.
about HeteroDB? ▌Corporate Profile Name: HeteroDB,Inc Location: Shinagawa, Tokyo, Japan Establish: 4th-Jul-2017 Businesses: Dev & Sales of GPU/SSD accelerated SQL database products Consulting service for SQL tuning with GPU Stock share: 100% by the founders ▌Founder’s Profile KaiGai Kohei (Chief Architect & co-founder; President) He has more than 10 years experiment for development of Linux kernel and PostgreSQL database in the developer’s community; contributed to various major features like SELinux support at PostgreSQL, update/join support of FDW, custom-scan interface and so on. In 2012, he launched PG-Strom project which transparently accelerates SQL workloads using GPU, and then he has focused on analytic SQL acceleration with heterogeneous computing architecture. The award of genius programmer by METI/IPA (2006); The award of OSS contributors by OSSPF (2014). Kashiwagi Takehiko (Chief Sales Engineer & co-founder; Vice-president) He has been a researcher at the NEC laboratory for networking technology, in-memory columnar storage, and GPGPU acceleration of database workloads. He had also co-worked with KaiGai as PG-Strom project member when both of us were employee of NEC. Right now, he is responsible to market / customer development in HeteroDB, in addition to his own company for networking solutions (Parallel-Networks,Inc). DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -2
3.
Our core technology– PG-Strom DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -3 ▌Features Utilization of GPU’s computing capability to off-load SQL workloads. Parallel execution by thousands cores. Auto-generation of GPU programs from the supplied SQL, and just-in- time compilation. Enables transparent acceleration. Open Source Software ▌Functionalities GPU parallel execution for WHERE- clause, JOIN, GROUP BY User defined GPU function (PL/CUDA) ▌Target usage Batch, summary / reporting In-database analytics / machine- learning Query Optimizer Query Executor PG-Strom Extension Storage Manager SQL Parser Application Storage PG-Strom: Extension of PostgreSQL for acceleration by GPU same SQL query same data schema
4.
GPU Characteristics Ultra parallelcomputing processor with thousands cores and hundreds GB/s class memory band in a chip CPU Like a passenger car; well utilizable but less transportation capacity. GPU Like a high-speed railway; a little bit troublesome to get in or out, but capable for mass-transportation. Model Intel Xeon E5-2699v4 NVIDIA Tesla P40 # of transistors 7.2billion 12billion # of cores 22 (functional) 3840 (simple) Performance (FP32) 1.2 TFLOPS (with AVX2) 12TFLOPS Memory size max 1.5TB (DDR4) 24GB (GDDR5) Memory bandwidth 76.8GB/s 347GB/s TDP 145W 250W DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -4
5.
Example of GPUcalculation – Reduction algorithm ●item[0] step.1 step.2 step.4step.3 Calculation of the total sum of an array by GPU Σi=0...N-1item[i] ◆ ● ▲ ■ ★ ● ◆ ● ● ◆ ▲ ● ● ◆ ● ● ◆ ▲ ■ ● ● ◆ ● ● ◆ ▲ ● ● ◆ ● item[1] item[2] item[3] item[4] item[5] item[6] item[7] item[8] item[9] item[10] item[11] item[12] item[13] item[14] item[15] Total sum of items[] with log2N steps Inter-cores synchronization with hardware support SELECT count(X), sum(Y), avg(Z) FROM my_table; Same logic is internally used to implement aggregate function. DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -5
6.
Is SQL workloadsimilar to image processing? Image Data = int/float[] array Transposition ID X Y Z SELECT * FROM my_table WHERE X BETWEEN 40 AND 60 Parallel Execution GPU’s advantage: Same operations to massive amount of data. DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -6
7.
Problems PG-Strom tryingto solve RAM Data (small) Primitive of PG-Strom ✓ Heavy SQLs (JOIN, GROUP BY) ✓ Data size less than physical RAM 2015 DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -7
8.
Problems PG-Strom tryingto solve RAM Data (small) Data (large) More computing intensive workloads More I/O intensive workloads PL/CUDA ✓ In-database analytics and machine- learning ✓ Drug discovery, anomaly detection SSD-to-GPU Direct SQL Execution ✓ Fast I/O nearly wired speed of H/W ✓ Large batch, reporting, and analytics 2017 DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -8 Primitive of PG-Strom ✓ Heavy SQLs (JOIN, GROUP BY) ✓ Data size less than physical RAM
9.
It’s second timefor me to have a talk at DB tech showcase! DBTS2014: GPGPU Accelerates PostgreSQL DBTS2017: GPU & SSD Accelerates PostgreSQL DB tech showcase 2014 Utilization of new devices according to expansion of the application domain DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -9
PostgreSQL Internals SQL Parser QueryOptimizer Query Executor Storage Manager Transaction Manager IPC (Lock, Latch, shmem) SQL Query Parse-tree Execution-Plan Query Results Parse the supplied SQL, then transform to an internal data structure (parse-tree) Detection of syntax errors Cost estimation based on the statistics Construction of the most optimal query execution-plan Run the workloads (like Scan, Join, Sort, ...) according to the query execution plan. Utilization of PostgreSQL internal infrastructures; like storage manager DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -11
12.
How PostgreSQL constructsquery execution plan (1/2) Scan t0 Scan t1 Scan t2 Join t0,t1 Join (t0,t1),t2 GROUP BY cat ORDER BY score LIMIT 100 DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -12
13.
How PostgreSQL constructsquery execution plan (2/2) Scan t0 Scan t1 Join t0,t1 Statistics) nrows: 1.2M width: 80 Index: none candidate HashJoin cost=4000 candidate MergeJoin cost=12000 candidate NestLoop cost=99999 candidate Parallel Hash Join cost=3000 candidate GpuJoin cost=2500 WINNER! Built-in execution path of PostgreSQLProposition by extensions (since PostgreSQL v9.5) (since PostgreSQL v9.6) GpuJoin t0,t1 Statistics) nrows: 4000 width: 120 Index: t1.id Competition of multiple algorithms, then chosen by the “cost”. DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -13
14.
Interactions between PostgreSQLand PG-Strom with CustomScan As long as consistent results are made, implementation is flexible. CustomScan (GpuJoin) (*BeginCustomScan)(...) (*ExecCustomScan)(...) (*EndCustomScan)(...) : SeqScan on t0 SeqScan on t1 GroupAgg key: cat ExecInitGpuJoin(...) Initialize GPU context Kick asynchronous JIT compilation of the GPU program auto-generated ExecGpuJoin(...) Read records from the t0 and t1, and copy to the DMA buffer Kick asynchronous GPU tasks Fetch results from the completed GPU tasks, then pass them to the next step (GroupAgg) ExecEndGpuJoin(...) Wait for completion of the asynchronous tasks (if any) Release of GPU resource DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -14
15.
SQL to GPUcode auto-generation (example of WHERE-clause) DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -15 QUERY: SELECT cat, count(*), avg(x) FROM t0 WHERE x between y and y + 20.0 GROUP BY cat; : STATIC_FUNCTION(bool) gpupreagg_qual_eval(kern_context *kcxt, kern_data_store *kds, size_t kds_index) { pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1); pg_float8_t KVAR_3 = pg_float8_vref(kds,kcxt,2,kds_index); pg_float8_t KVAR_4 = pg_float8_vref(kds,kcxt,3,kds_index); return EVAL((pgfn_float8ge(kcxt, KVAR_3, KVAR_4) && pgfn_float8le(kcxt, KVAR_3, pgfn_float8pl(kcxt, KVAR_4, KPARAM_1)))); } : E.g) Transformation of the numeric-formula in WHERE-clause to CUDA C code on demand Reference to input data SQL expression in CUDA source code Run-time compiler Parallel Execution
16.
A simple micro-benchmark DBTech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -16 ▌Test Query: SELECT cat, count(*), avg(x) FROM t0 NATURAL JOIN t1 [NATURAL JOIN t2 ...] GROUP BY cat; ✓ t0 contains 100M rows, t1...t8 contains 100K rows (like a star schema) 8.48 13.23 18.28 23.42 28.88 34.50 40.77 47.16 5.00 5.46 5.91 6.45 7.17 8.07 9.22 10.21 0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 2 3 4 5 6 7 8 9 QueryResponseTime[sec] Number of tables joined PG-Strom microbenchmark with JOIN/GROUP BY PostgreSQL v9.6 PG-Strom 2.0devel CPU: Xeon E5-2650v4 GPU: Tesla P40 RAM: 128GB OS: CentOS 7.3 DB: PostgreSQL 9.6.2 + PG-Strom 2.0devel
17.
Problems PG-Strom tryingto solve RAM data (small) data (large) More computing intensive workloads More I/O intensive workloads How to move onto this domain DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -17
SSD-to-GPU Direct SQLExecution DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -19 Large PostgreSQL Tables PCIe Bus NVMe SSD GPUSSD-to-GPU P2P DMA (NVMe-Strom driver) WHERE-clause JOIN GROUP BYPostgreSQL Data Blocks Pre-processing of records according to the SQL. It reduces amount of data to be loaded / processed by CPU. Existing data-flow (Large I/O size) Load PostgreSQL data blocks on SSD to GPU using peer-to-peer DMA, to execute SQL workloads on GPU, for reduction of the data size prior to CPU/RAM loading.
20.
Element technology① GPUDirectRDMA (1/2) ▌P2P data transfer technology between GPU and other PCIe devices, bypass CPU Originally designed for multi-nodes MPI over Infiniband Infrastructure of Linux kernel driver for other PCIe devices, including NVMe-SSDs. Copyright (c) NVIDIA corporation, 2015 DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -20
21.
Element technology① GPUDirectRDMA (2/2) Physical address space PCIe BAR1 Area GPU device memory RAM NVMe-SSD Infiniband HBA PCIe device GPUDirect RDMA It enables to map GPU device memory on physical address space of the host system Once “physical address of GPU device memory” appears, we can use is as source or destination address of DMA with PCIe devices. DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -21 0xf0000000 0xe0000000 DMA Request SRC: 1200th sector LEN: 40 sectors DST: 0xe0200000
Benchmark configuration (2/2)– Key devices DB Tech Showcase 2017 - GPU/SSDがPostgreSQLを加速する-24 model HGST Ultrastar SN260 Interface PCIe 3.0 x8 (NVMe 1.2) Capacity 7.68TB Form Factor HHHL Sequential Read 6,170MB/s Sequential Write 2,200MB/s Random Read 1200K IOPS Random Write 75K IOPS model NVIDIA Tesla P40 GPU Architecture NVIDIA Pascal Interface PCIe 3.0 x16 Form Factor FHFL, Dual-Slot # of CUDA cores 3840 Performance (FP32) 12 TFLOPS GPU Memory 24GB (GDDR5) GPU Memory Band 346GB/s TDP 250W SPECIAL THANKS FOR: SPECIAL THANKS FOR:
25.
Benchmark results (1/2) We run the Q1-Q4 below towards lineorder table (351GB). Definition of throughput is (351 * 1024) / (Query response time[sec]) Q1... SELECT count(*) FROM lineorder; Q2... SELECT count(*),sum(lo_revenue),sum(lo_supplycost) FROM lineorder; Q3... SELECT count(*) FROM lineorder GROUP BY lo_orderpriority; Q4... SELECT count(*),sum(lo_revenue),sum(lo_supplycost) FROM lineorder GROUP BY lo_shipmode; ※ max_parallel_workers_per_gather is 24 for PostgreSQL v9.6, 4 for PG-Strom. 889.05 859.31 875.69 842.04 5988.0 5989.9 5988.8 5979.6 0 1000 2000 3000 4000 5000 6000 Q1 Q2 Q3 Q4 QueryExecutionThroughput[MB/s] PostgreSQL v9.6 + SN260 PG-Strom v2.0 + SN260 DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -25
26.
Benchmark results (2/2) Full scan on the 351G table by the 128GB RAM system, thus its storage system obviously limits the query execution performance. PG-Strom could pull out Seq-Read performance of the SSD close to the catalog spec. PostgreSQL has unignorable overhead around I/O. 0 1000 2000 3000 4000 5000 6000 7000 0 100 200 300 400 StorageReadThroughput[MB/s] Elapsed Time for Query Execution [sec] Time Series Results (Q4) with iostat PG-Strom v2.0devel + SN260 PostgreSQL v9.6 + SN260 [kaigai@saba ~]$ iostat -cdm 1 : Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn sda 6.00 0.19 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdc 0.00 0.00 0.00 0 0 nvme0n1 24059.00 5928.41 0.00 5928 0 : DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -26
27.
Considerations to thehardware configuration DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -27 ① SSD and GPU are connected via PCIe switch. OK ② A single CPU/IOH is involved in the path of SSD/GPU. Workable ③ SSD/GPU path traverses QPI/HT link. Not Supported CPU CPU PLX SSD GPU PCIe switch HeteroDB has tested this configuration. Up to 9.5GB/s was observed in raw-i/o, by Xeon E5-2650v4 Very slow. Several dozen – hundreds MB/s throughput was observed in our test. It should not be used. CPU CPU SSD GPU CPU CPU SSD GPU QPI Pros: Easy to obtain the supported hardware Cons: Top performance is less than PLX. Pros: Maximum performance Cons: Only limited hardware supports this configuration. Pros: nothing Cons: Slow or not workable
28.
Consideration to I/Operformance of PostgreSQL ① Too frequent system call invocations because of small block size ② Multiple buffer copies on block-layer / filesystems PostgreSQL (Shared Buffer) Filesystem (Page Cache) Block Device (DMA Buffer) NVMe-SSD read(2) DMA request DMA completion memcpy to Page Cache memcpy to Shared Buffer summary processing summary processing request to block read Dramatic improvement of device latency by NVMe-SSD buffer copy System call invocations DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -28
Consistency with diskcache DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -32 Challenges: ① If source blocks are kept in the disk cache, it may be dirty and storage blocks are not up-to-date. ② It is inefficient to issue RAM-to-GPU DMA with small unit size (8kB). Solution: Write-back cached blocks (8kB unit-size) to userspace buffer, then kicks bulk RAM-to-DMA using CUDA APIs at once. ✓ Equivalent processing penalty with read(2) + cuMemcpyHtoDAsync() ✓ Re-order of blocks on GPU memory does not affect to adequacy of the operation. BLK-100: uncached BLK-101: cached BLK-102: uncached BLK-103: uncached BLK-104: cached BLK-105: cached BLK-106: uncached BLK-107: uncached BLK-108: cached BLK-109: uncached BLCKSZ (=8KB) Transfer Size Per Request BLCKSZ * NChunks BLK-108: cached BLK-105: cached BLK-104: cached BLK-101: cached BLK-100: uncached BLK-102: uncached BLK-103: uncached BLK-106: uncached BLK-107: uncached BLK-109: uncached BLK-108: cached BLK-105: cached BLK-104: cached BLK-101: cached unused SSD-to-GPU P2P DMA File Userspace DMA Buffer (RAM) Device Memory (GPU) CUDA API (userspace) cuMemcpyHtoDAsync
Why challenge to10GB/s with single node? DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -34 NOT because it’s there.
35.
Why challenge to10GB/s with single node? DWH appliance in the prev-generation Performance: 8.6GB/s- Cost: tens of millions JPY- Time of system replacement after 5 years. How much cost for successors? Small Hadoop cluster Operations: administration of multi-node system is usually troublesome . What’s choice, if equivalent performance would be capable with single-node and SQL. Enhanced OLAP capability with GPU/SSD; equivalent performance to the DWH appliance in the previous generation. 1/10 System cost Functionalities as DBMS are identical with PostgreSQL which has widespread and long- standing user base.HeteroServer 120GS (Mar-2018, planned) DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -35
36.
(Ref) GTC2017 chose“SSD-to-GPU direct SQL” as a top-5 finalist DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -36 Our R&D was chosen to the top-5 finalist in the 138 posters from all over the world. (Sadly, it was the second place by voting...) GPU Technology Conference 2017 8th~ 11th May, San Jose At May-2017, measured throughput was 4.5GB/s with SSDx3 configuration. Theoretically, 6.6GB/s. Why?
37.
Investigation using visualprofiler Time consumption of sub-kernels launched by dynamic parallelism were too small (=less than 10%). DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -37
38.
Investigation using visualprofiler Less concurrency of GPU kernels for SQL processing by the synchronization of GPU sub-kernels. DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -38
39.
Improvement by reductionof in-GPU synchronization points GPU’s resource usage gets back to the reasonable level. Storage limits the performance again. GPUs are enough capable to process 10GB data per second. DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -39
40.
Towards 10GB/s queryprocessing throughput ▌Reduction of in-GPU kernel synchronization points GpuScan and GpuPreAgg were already reworked. Works for GpuJoin are in-progress. ▌Improvement of PG-Strom’s GPU task scheduler Elimination of dynamic parallelism and utilization of unified memory allows to reduce unnecessary context switch by external GPU servers. ▌GpuScan + GpuJoin + GpuPreAgg Combined Kernel Little CPU interactions on typical analytics / reporting workloads. Execution of JOIN/GROUP BY with wired speed of table scan. ▌Optimization for md-raid0 (striping) To supply data stream with 10GB/s band, driver performance shall be improved on multi NVMe-SSDs configurations. DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -40
41.
GpuScan + GpuJoin+ GpuPreAgg Combined Kernel (1/5) Aggregation GROUP BY JOIN SCAN SELECT cat, count(*), avg(x) FROM t0 JOIN t1 ON t0.id = t1.id WHERE y like ‘%abc%’ GROUP BY cat; count(*), avg(x) GROUP BY cat t0 JOIN t1 ON t0.id = t1.id WHERE y like ‘%abc%’ Results DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -41 GpuScan GpuJoin Agg + GpuPreAgg
42.
GpuScan + GpuJoin+ GpuPreAgg Combined Kernel (2/5) GpuScan kernel GpuJoin kernel GpuPreAgg kernel DMA Buffer Agg (PostgreSQL) GPU CPU Storage Simple replacement of the existing logic leads ping-pong of data-transfer between CPU and GPU. DMA Buffer DMA Buffer Results DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -42
GpuScan + GpuJoin+ GpuPreAgg Combined Kernel (4/5) Extreme data reduction on GPU-side by GROUP BY/Aggregation without any interaction to CPU. GpuScan kernel GpuJoin kernel GpuPreAgg kernel DMA Buffer Agg (PostgreSQL) GPU CPU Storage GpuScan + GpuJoin + GpuPreAgg Combined Kernel data size = large data size = small DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -44 Results
45.
GpuScan + GpuJoin+ GpuPreAgg Combined Kernel (5/5) DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -45 Only 1.2kB was written back to CPU-side, after the 10GB of table scan.
46.
Towards PG-Strom v2.0 Dec-2017,PG-Strom v2.0 beta Mar-2018, PG-Strom v2.0 GA HeteroServer 120GS New features development SSD-to-GPU Direct SQL Execution, and related Reworks of GPU task scheduler On GPU data store (as foreign table) Columnar Cache mechanism In-database Analytics and Machine-Learning Library QA & Release Test / Stabilization Packaging Documentation Early adopter development Open Source Activity Business Activity DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -46
OLTP World OLAPWorld Idea of the intelligent storage (1/2) DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -48 Row-to-Column translation on SSD enables Write-by-Row, Read-as-Column RowCol Translation Logic Column Data Read of Analytic Data (column-format) Write of transactional data (row-format) Data format transformation on SSD SQL Execution on top of column data; that pull out max performance of GPU Pre-processed records (much smaller)Picks up only referenced data
49.
Why GPU preferscolumnar storage format DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -49 ▌Row-format (transactional data) – random memory access Increase of memory transaction, less usage ratio of memory-bus ▌Column-format (analytic data) – coalesced memory access Least number of memory transaction, maximum usage ratio of memory-bus 32bit Memory transaction width: 256bit 32bit 32bit32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit Memory transaction width: 256bit 32bit x 8 = 256bit of 256bit memory transaction are valid (100% of usage ratio) 32bit x 1 = 32bit of 256bit memory transaction are valid (12.5% of usage ratio) GPU Cores GPU Cores Transactional data leads inefficient memory bus usage
50.
Idea of theintelligent storage (2/2) DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -50 Large PostgreSQL Tables PCIe Bus “SQL-aware” NVMe SSD GPU SSD-to-GPU P2P DMA WHERE-clause JOIN GROUP BYPostgreSQL Data Blocks Row-to-column transformation Extract only referenced attributes enables to pull out “effective bandwidth” more than the PCIe limitation. “Ultra” parallel SQL execution by GPU Columnar data format, the optimal one for GPU, allows to pull out maximum computing capability for SQL processing. Write as row-data Keeps the performance of usual transactional workloads, and simultaneously allows real-time analytics / reporting.