Demystifying the Distributed Database Landscape (DevOps) (1).pdf

Demystifying the Distributed Database Landscape A survey of technologies

Getting Started with ScyllaDB Join our next ScyllaDB Virtual Workshop! scylladb.com/webinars 2 April 28, 2022 | 10AM PT | 1PM ET | 6PM GMT

Poll Where are you in your NoSQL Adoption? 3

5 + For distributed, data-intensive apps that require high performance and low latency + 400+ users worldwide + Results + Comcast: Reduced P99 latencies by 95% + FireEye: 1500% improvement in throughput + Discord: Reduced C* nodes from ~140 to 6 + iFood: 9X cost reduction vs. DynamoDB + Open Source, Enterprise and Cloud options + Fully compatible with Apache Cassandra and Amazon DynamoDB About ScyllaDB 1ms <1ms 10ms 1M 10M ScyllaDB Universe of 400+ Users

400+ Companies Use ScyllaDB Seamless experiences across content + devices Fast computation of flight pricing Corporate fleet management Real-time analytics 2,000,000 SKU -commerce management Real-time location tracking for friends/family Video recommendation management IoT for industrial machines Synchronize browser properties for millions Threat intelligence service using JanusGraph Real time fraud detection across 6M transactions/day Uber scale, mission critical chat & messaging app 6 Network security threat detection Power ~50M X1 DVRs with billions of reqs/day Precision healthcare via Edison AI Inventory hub for retail operations Property listings and updates Unified ML feature store across the business Cryptocurrency exchange app Geography-based recommendations Distributed storage for distributed ledger tech Global operations- Avon, Body Shop + more Predictable performance for on sale surges GPS-based exercise tracking

Peter Corless 7 Director of Technical Advocacy @ ScyllaDB + Listen to & share user stories + Write blogs & case studies + Play (and design) strategy & roleplaying games + @PeterCorless on Twitter

This Next Tech Cycle The wave of innovation we’re currently riding. 8

Hardware, software, and methodologies are all co-evolving to create this next tech cycle. 9

This Next Tech Cycle 2000 2010 2020 2025+ Transistor Count 42M Pentium 4 (2000) 228M Pentium D (2005) 2.3B Xeon Nahalem-EX (2010) 10B SPARC M7 (2015) 39B Epyc Rome (2019) Core Count 1 2 8 32 64 ~60B? Epyc Genoa (2022) 96 ~80B? Epyc Bergamo (2023) 128 1.2 ZB IP traﬃc (2016) 2 ZB Data stored (2010) 64 ZB Data stored (2020) Broadband Speeds 3G (2002) 105mbps (2014) 1.5 mbps (2002) 16 mbps (2008) Wireless Services 3Gbps (2021) 1Gbps (2018) 4G (2014) 5G (2018) Zettabyte Era ~180 ZB Data stored (2025) Public Cloud to Multicloud AWS (2006) GCP (2008) Azure (2010) 1021 10 Azure Arc

11 + Compute + From 100+ cores → 1,000+ cores per server + From multicore CPUs → full System on a Chip (SoC) designs (CPU, GPU, Cache, Memory) + Memory + Terabyte-scale RAM per server + DDR5 — 4600 MHz in 2020, 8000 MHz by 2024 + DDR6 — 9600 MHz by 2025 + Persistent memory — memory mode + Storage + Petabyte-scale storage per server + NVMe 2.0 [2021] — separation of base and transport + Persistent memory — app direct (storage) mode Hardware Still Vertically Scaling

Hybrid & Multi-cloud is Now-ish 12 Azure Arc

13 + Agile [c. 2000] + Microservices Architecture [2005] + CI/CD = CI [1991] + CD [2009] + DevOps [2009] + Chaos Monkey [2011] + Kubernetes [2014] + GitOps [2017] + DevSecOps [2018] Methodologies Still Evolving How It Started How It’s Going How It Evolved

+ <1 terabyte + 1 to 50 terabytes + 50-100 terabytes + >100 terabytes How much data do you have under management in your own transactional database systems? Poll Question 15

The Distributed Database Landscape Here there be monstrous databases! 16

Distributed Database Landscape 2021 SQL + Distributed SQL + NewSQL NoSQL + Key-value + Document + Wide-column + Graph Multi-model + SQL + NoSQL + Multiple NoSQL Production Environments + On-premises + Co-location + Public cloud + Private cloud + Hybrid cloud + Multicloud + Edge + IoT / Embedded Business / Use Models + Open Source License + Enterprise License + OEM License + Service Agreements Use Cases + OLTP + OLAP + HTAP + Time Series 17

Top 100 Databases (and Database-like systems) on DB-Engines.com ranking [as of April 2022] + 47 SQL + 25 NoSQL + 11 Multimodel (multiple NoSQL models and/or SQL + NoSQL) + 7 Search Engines + 5 Time Series + 5 Other 18 DB-Engines.com Top 100 Databases

19 “Well…” Are all of the Top 100 “Distributed Databases?”

20 + Clustering & Distribution Strategies + Local clustering — multiple nodes in the same datacenter share updates + Cross-cluster updates — multiple clusters can share data between them + Multi-datacenter clustering — geographically, even globally disbursed. but same logical cluster + Node Roles, High Availability & Failover Strategies + Primary-replica (Active-passive; writes to primary only; read-only replicas; “hot standby” modes) + Peer-to-peer, leaderless (Active-Active, multi primaries; can write to any replica; no SPOF) + Load balancing (client side or service in front of database) + Data Replication & Sharding Strategies + Replication Factors & Consistency Levels + Horizontal Scalability: Manual vs. Auto-sharding + Topology Awareness: Rack-awareness, Datacenter-awareness What do you mean by a “Distributed Database?”

21 The Short List: Systems of Interest SQL + NewSQL NoSQL PostgreSQL MongoDB CockroachDB Redis ScyllaDB (Cassandra) Just in case: SQL vs. NoSQL

Clustering & Locality: Topology Awareness 22 1 2 3 1 2 3 1 2 3 Cross-Cluster Updates or Multi-Datacenter Clustering Non-Rack Aware Can have multiple nodes in same rack Rack Aware Distribute nodes evenly across all available racks in a datacenter Zone and Datacenter Aware Provides survivability across geography Reduces local latencies Considerations for data localization Local Clustering

23 Support Multi-Datacenter Clustering? SQL + NewSQL NoSQL PostgreSQL MongoDB CockroachDB Redis ScyllaDB (Cassandra) Designed for multi-datacenter Designed for single-datacenter; capable of multi-datacenter

24 Clustering & Replication: Primary-Replica Set vs. Peer-to-Peer (Leaderless) Primary-Replica (Multiple Replicas) Only primary accepts writes; secondaries are read-only Replication is “fan out” from primary to replicas Write-heavy workloads can tax the primary Peer-to-Peer Active-Active (Multi-Datacenter) Each node accepts reads+writes Inherently better load balancing Deals better w/ write-heavy or mixed read-write workloads R+W R+W Read Only Read Only Replication Servers Clients Clients Servers ScyllaDB MongoDB

25 Support Active-Active vs. Primary-Replica SQL + NewSQL NoSQL PostgreSQL MongoDB CockroachDB Redis ScyllaDB (Cassandra) Active-Active Primary-Replica; active-active only through optional solutions

26 Clustering: Cross-Cluster Updates vs. Multi-Datacenter Replication Primary-Replica (Multiple Replicas) Only primary accepts writes; secondaries are read-only Source: MongoDB Source: ScyllaDB Peer-to-Peer Active-Active (Multi-Datacenter) Each node can accept reads and writes; leaderless RF=3 RF=2

27 Topology Awareness SQL + NewSQL NoSQL PostgreSQL MongoDB CockroachDB Redis ScyllaDB (Cassandra) Topology Aware Not built-in, but can be added

28 PostgreSQL — distributed SQL + Clustering & Distribution Strategies + Local clustering — multiple nodes in the same datacenter share updates + Cross-cluster updates — multiple clusters can share data between them + Multi-datacenter clustering — geographically, even globally disbursed. but same logical cluster + Node Roles, High Availability & Failover Strategies + Primary-replica (Active-passive; writes to primary only; read-only replicas; “hot standby” modes) + Peer-to-peer, leaderless (Active-Active, multi primaries; can write to any replica; no SPOF) + Load balancing (client side or service in front of database) + Data Replication & Sharding Strategies + Replication Factors & Consistency Levels + Horizontal Scalability: Manual Sharding vs. Auto-sharding + Topology Awareness: Rack-awareness, Datacenter-awareness Part of base offering Can be added, but not part of base

29 CockroachDB — NewSQL + Clustering & Distribution Strategies + Local clustering — multiple nodes in the same datacenter share updates + Cross-cluster updates — multiple clusters can share data between them + Multi-datacenter clustering — geographically, even globally disbursed. but same logical cluster + Node Roles, High Availability & Failover Strategies + Primary-replica (Active-passive; writes to primary only; read-only replicas; “hot standby” modes) + Peer-to-peer, leaderless (Active-Active, multi primaries; can write to any replica; no SPOF) + Load balancing (client side or service in front of database) + Data Replication & Sharding Strategies + Replication Factors & Consistency Levels + Horizontal Scalability: Manual vs. Auto-sharding + Topology Awareness: Rack-awareness*, Datacenter-awareness * Can be manually conﬁgured using localities Part of base offering Can be added, but not part of base

30 + Clustering & Distribution Strategies + Local clustering — multiple nodes in the same datacenter share updates + Cross-cluster updates — multiple clusters can share data between them + Multi-datacenter clustering — geographically, even globally disbursed. but same logical cluster + Node Roles, High Availability & Failover Strategies + Primary-replica (Active-passive; writes to primary only; read-only replicas; “hot standby” modes) + Peer-to-peer, leaderless (Active-Active, multi primaries; can write to any replica; no SPOF) + Load balancing (client side or service in front of database) + Data Replication & Sharding Strategies + Replication Factors & Consistency Levels + Horizontal Scalability: Manual vs. Auto-sharding + Topology Awareness: Rack-awareness, Datacenter-awareness MongoDB — the leading document store Part of base offering Can be added, but not part of base

31 + Clustering & Distribution Strategies + Local clustering — multiple nodes in the same datacenter share updates + Cross-cluster updates — multiple clusters can share data between them + Multi-datacenter clustering — geographically, even globally disbursed. but same logical cluster* + Node Roles, High Availability & Failover Strategies + Primary-replica (Active-passive; writes to primary only; read-only replicas; “hot standby” modes) + Peer-to-peer, leaderless (Active-Active, multi primaries; can write to any replica; no SPOF)* + Load balancing (client side or service in front of database) + Data Replication & Sharding Strategies + Replication Factors & Consistency Levels (e.g., strong locally; causal consistency in active-active*) + Horizontal Scalability: Manual vs. Auto-sharding + Topology Awareness: Rack-awareness, Datacenter-awareness Redis — key-value in-memory DB/cache * Redis Enterprise feature Part of base offering Can be added, but not part of base

32 + Clustering & Distribution Strategies + Local clustering — multiple nodes in the same datacenter share updates + Cross-cluster updates — multiple clusters can share data between them + Multi-datacenter clustering — geographically, even globally disbursed. but same logical cluster + Node Roles, High Availability & Failover Strategies + Primary-replica (Active-passive; writes to primary only; read-only replicas; “hot standby” modes) + Peer-to-peer, leaderless (Active-Active, multi primaries; can write to any replica; no SPOF) + Load balancing (client side or service in front of database*) + Data Replication & Sharding Strategies + Replication Factors & Consistency Levels + Horizontal Scalability: Manual vs. Auto-sharding + Topology Awareness: Rack-awareness, Datacenter-awareness ScyllaDB Part of base offering * For DynamoDB-compatible API

33 But for now, let’s move on...

Where are Distributed Databases Headed Next? Time to read the tea leaves 34

35 The Trend for SQL + Google Trends for “SQL” are at 22% rate of 2004 + Book citations for “SQL” peaked in 2008 and were down to 28% of that rate by 2019 + Back to 1995 levels of interest, basically + Still dwarfs other database terms like “NoSQL” or “NewSQL” or “RDBMS” + No single term or technology sums up the distributed database market anymore

36 + Cambrian Explosion will Continue — “What is a database anyway?” + Distributed Databases of all kinds + Distributed Streaming — “Kafka as a database?” (kSQL says “Yes!”) + Distributed Ledgers — “Blockchains/DAGs as a database?” + Further fragmentation of the market + NoSQL + SQL blending increasingly + Evolution of NoSQL back to SQL assumptions + Adding back Strong Consistency, Schema Constraints, Strict Typing Where are Distributed Databases Going?

37 + “Cloud Native” — What does that mean to you? + Elasticity — Faster provisioning/decommissioning, autoscaling + Serverless — “I don’t want to manage hardware; just give me an API.” + Uncoupling Compute from Storage — Tiered Storage, Plug-in Storage + Data over Time + Built for Event Streaming, Time Series + Data over Space + Geospatial queries, Geoindexing + Geographic / political boundaries — GDPR, data localization regulatory compliance Further Trends in Distributed Databases

38 + Increasing Focus on Developer Enablement and Developer Experience (DX) + APIs for extensibility: extensions, plugins, modules, add-ons, integration layers + Database Speciﬁc: PostgreSQL extensions, Redis modules + Cross-industry: GraphQL, OpenAPI (Swagger), etc. + AI/ML integration and incorporation into databases + “Building models where your data resides” — Martin Heller (Apr 2021) + Amazon Redshift ML + BigQuery ML + Oracle, Db2, Microsoft SQL Server Database as a Development Platform

39 + Tighter Coupling of Data Engineering + Data Sciences + Operations + Repairing rifts of the past decade + Bridging huge divides between people and systems + From “Data Pipelining” (production-oriented) to... + “Data Supply Chains” (consumption-oriented) + Like “Software Supply Chain,” but for data and data products. Data Teaming

40 + Specializing databases to run in the cloud (and cloud-only) + Providing “concierge” services + Ecosystem: can integrate into cloud vendor’s (or partners’) offerings + Managed for you — at a price + Making Open Source databases easier to run on infrastructural level + Making self-managed operations simpler + Flexibility: can run on premises or in the cloud + Self-service model — so long as you have the skillz We Need Different Kinds of “Easy”

Hope You Enjoyed Your Trip! http://slack.scylladb.com/ 41

+ Kostja Osipov + Serge Leontiev Thanks Any errors, omissions, misinterpretations, misrepresentations or misunderstandings are purely my own. Please send suggestions and corrections to peter@scylladb.com People who helped educate me Disclaimer 42

Poll How much data do you have under management in your own transactional database? 43

United States 2445 Faber St, Suite #200 Palo Alto, CA USA 94303 Israel Maskit 4 Herzliya, Israel 4673304 www.scylladb.com @scylladb @PeterCorless Q&A Join our Next Virtual Workshop! scylladb.com/webinars Thank You! Stay in touch!

Demystifying the Distributed Database Landscape (DevOps) (1).pdf

More Related Content

Similar to Demystifying the Distributed Database Landscape (DevOps) (1).pdf

More from ScyllaDB

Recently uploaded

Demystifying the Distributed Database Landscape (DevOps) (1).pdf