Introduction to Apache Kafka And Real-Time ETL for DBAs and others who are interested in new ways of working with relational databases 1
About Myself • Gwen Shapira – SystemArchitect @Confluent • Committer @Apache Kafka,Apache Sqoop • Author of “HadoopApplicationArchitectures”, “Kafka – The Definitive Guide” • Previously: • Software Engineer @ Cloudera • OracleACE Director • Senior Consultants @ Pythian • DBA@ Mercury Interactive • Find me: • gwen@confluent.io • @gwenshap
There’s a Book on That!
Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical Illusion
• Write-ahead Logs • So What is Kafka? • Awesome use-case for Kafka • Data streams and real-time ETL • Where can you learn more We’ll talk about:
Write-Ahead Logging (WAL) a standard method for ensuring data integrity… changes to data files … must be written only after those changes have been logged… in the event of a crash we will be able to recover the database using the log.
Important Point The write-ahead log is the only reliable source of information about current state of the database.
WAL is used for • Recover consistent state of a database • Replicate the database (Streaming Replication, Hot Standby) If you look far enough into archived logs – you can reconstruct the entire database.
That’s nice, but what is Kafka?
Kafka provides a fast, distributed, highly scalable, highly available, publish-subscribe messaging system. Based on the tried and true log structure. In turn this solves part of a much harder problem: Communication and integration between components of large software systems
The Basics •Messages are organized into topics •Producers push messages •Consumers pull messages •Kafka runs in a cluster. Nodes are called brokers
Topics, Partitions and Logs
Each partition is a log
Each Broker has many partitions Partition 0 Partition 0 Partition 1 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partion 2
Producers load balance between partitions Partition 0 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 1 Partion 2 Client
Producers load balance between partitions Partition 0 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 1 Partion 2 Client
Consumers Consumer Group Y Consumer Group X Consumer Kafka Cluster Topic Partition A (File) Partition B (File) Partition C (File) Consumer Consumer Consumer Order retained with in partition Order retained with in partition but not over partitionsOffSetX OffSetX OffSetX OffSetYOffSetYOffSetY Off sets are kept per consumer group
Kafka “Magic” – Why is it so fast? • 250M Events per sec on one node at 3ms latency • Scales to any number of consumers • Stores data for set amount of time – Without tracking who read what data • Replicates – but no need to sync to disk • Zero-copy writes from memory / disk to network
How do people use Kafka? • As a message bus • As a buffer for replication systems • As reliable feed for event processing • As a buffer for event processing • Decouple apps from databases
But really, how do they use Kafka? 21
Raise your hand if this sounds familiar “My next project was to get a working Hadoop setup… Having little experience in this area, we naturally budgeted a few weeks for getting data in and out, and the rest of our time for implementing fancy algorithms. “ --Jay Kreps, Kafka PMC
23 Client Source Data Pipelines Start like this.
24 Client Source Client Client Client Then we reuse them
25 Client Backend Client Client Client Then we add consumers to the existing sources Another Backend
26 Client Backend Client Client Client Then it starts to look like this Another Backend Another Backend Another Backend
27 Client Backend Client Client Client With maybe some of this Another Backend Another Backend Another Backend
Queues decouple systems: Adding new systems doesn’t require changing Existing systems
This is where we are trying to get 29 Source System Source System Source System Source System Kafka decouples Data Pipelines Hadoop Security Systems Real-time monitoring Data Warehouse Kafka Producers Brokers Consumers Kafka decouples Data Pipelines
Important notes: • Producers and Consumers dont need to know about each other • Performance issues on Consumers dont impact Producers • Consumers are protected from herds of Producers • Lots of flexibility in handling load • Messages are available for anyone – lots of new use cases, monitoring, audit, troubleshooting http://www.slideshare.net/gwenshap/queues-pools-caches
My Favorite Use Cases • Shops consume inventory updates • Clicking around an online shop? Your clicks go to Kafka and recommendations come back. • Flagging credit card transactions as fraudulent • Flagging game interactions as abuse • Least favorite: Surge pricing in Uber • Huge list of users at kafka.apache.org 31
Got it! But what about real-time ETL?
Remember This? 33 Source System Source System Source System Source System Kafka decouples Data Pipelines Hadoop Security Systems Real-time monitoring Data Warehouse Kafka Producers Brokers Consumers Kafka is smack in middle of all Data Pipelines
If data flies into Kafka in real time Why wait 24h before pulling it into a DWH? 34
35
Why Kafka makes real-time ETL better? • Can integrate with any data source • RDBMS, NoSQL, Applications, web applications, logs • Consumers can be real-time But they do not have to • Reading and writing to/from Kafka is cheap • So this is a great place to store intermediate state • You can fix mistakes by rereading some of the data again • Same data in same order • Adding more pipelines / aggregations has no impact on source systems = low risk 36
It is all valuable data Raw data Raw data Clean data Aggregated dataClean data Enriched data Filtered data Dash board Report Data scientist Alerts OMG
• Producers • Log4J • Rest Proxy • BottledWater • KafkaConnect and its connectors ecosystem • Other ecosystem OK, but how does my data get into Kafka
• However you want: • You just consume data, modify it, and produce it back • Built into Kafka: • Kprocessor • Kstream • Popular choices: • Storm • SparkStreaming But wait, how do we process the data?
One more thing...
Schema is a MUST HAVE for data integration
Need More Kafka? • https://kafka.apache.org/documentation.html • My video tutorial: http://shop.oreilly.com/product/0636920038603.do • http://www.michael-noll.com/blog/2014/08/18/apache-kafka- training-deck-and-tutorial/ • Our website: http://confluent.io • Oracle guide to real-time ETL: http://www.oracle.com/technetwork/middleware/data- integrator/overview/best-practices-for-realtime-data-wa-132882.pdf

kafka for db as postgres

  • 1.
    Introduction to ApacheKafka And Real-Time ETL for DBAs and others who are interested in new ways of working with relational databases 1
  • 2.
    About Myself • GwenShapira – SystemArchitect @Confluent • Committer @Apache Kafka,Apache Sqoop • Author of “HadoopApplicationArchitectures”, “Kafka – The Definitive Guide” • Previously: • Software Engineer @ Cloudera • OracleACE Director • Senior Consultants @ Pythian • DBA@ Mercury Interactive • Find me: • gwen@confluent.io • @gwenshap
  • 3.
  • 4.
    Apache Kafka is publish-subscribemessaging rethought as a distributed commit log. turned into a stream data platform An Optical Illusion
  • 5.
    • Write-ahead Logs •So What is Kafka? • Awesome use-case for Kafka • Data streams and real-time ETL • Where can you learn more We’ll talk about:
  • 6.
    Write-Ahead Logging (WAL) astandard method for ensuring data integrity… changes to data files … must be written only after those changes have been logged… in the event of a crash we will be able to recover the database using the log.
  • 7.
    Important Point The write-aheadlog is the only reliable source of information about current state of the database.
  • 8.
    WAL is usedfor • Recover consistent state of a database • Replicate the database (Streaming Replication, Hot Standby) If you look far enough into archived logs – you can reconstruct the entire database.
  • 10.
    That’s nice, butwhat is Kafka?
  • 11.
    Kafka provides afast, distributed, highly scalable, highly available, publish-subscribe messaging system. Based on the tried and true log structure. In turn this solves part of a much harder problem: Communication and integration between components of large software systems
  • 12.
    The Basics •Messages areorganized into topics •Producers push messages •Consumers pull messages •Kafka runs in a cluster. Nodes are called brokers
  • 13.
  • 14.
  • 15.
    Each Broker hasmany partitions Partition 0 Partition 0 Partition 1 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partion 2
  • 16.
    Producers load balancebetween partitions Partition 0 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 1 Partion 2 Client
  • 17.
    Producers load balancebetween partitions Partition 0 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 1 Partion 2 Client
  • 18.
    Consumers Consumer Group Y ConsumerGroup X Consumer Kafka Cluster Topic Partition A (File) Partition B (File) Partition C (File) Consumer Consumer Consumer Order retained with in partition Order retained with in partition but not over partitionsOffSetX OffSetX OffSetX OffSetYOffSetYOffSetY Off sets are kept per consumer group
  • 19.
    Kafka “Magic” –Why is it so fast? • 250M Events per sec on one node at 3ms latency • Scales to any number of consumers • Stores data for set amount of time – Without tracking who read what data • Replicates – but no need to sync to disk • Zero-copy writes from memory / disk to network
  • 20.
    How do peopleuse Kafka? • As a message bus • As a buffer for replication systems • As reliable feed for event processing • As a buffer for event processing • Decouple apps from databases
  • 21.
    But really, how dothey use Kafka? 21
  • 22.
    Raise your handif this sounds familiar “My next project was to get a working Hadoop setup… Having little experience in this area, we naturally budgeted a few weeks for getting data in and out, and the rest of our time for implementing fancy algorithms. “ --Jay Kreps, Kafka PMC
  • 23.
  • 24.
  • 25.
    25 Client Backend Client Client Client Then weadd consumers to the existing sources Another Backend
  • 26.
    26 Client Backend Client Client Client Then itstarts to look like this Another Backend Another Backend Another Backend
  • 27.
    27 Client Backend Client Client Client With maybesome of this Another Backend Another Backend Another Backend
  • 28.
    Queues decouple systems: Addingnew systems doesn’t require changing Existing systems
  • 29.
    This is wherewe are trying to get 29 Source System Source System Source System Source System Kafka decouples Data Pipelines Hadoop Security Systems Real-time monitoring Data Warehouse Kafka Producers Brokers Consumers Kafka decouples Data Pipelines
  • 30.
    Important notes: • Producersand Consumers dont need to know about each other • Performance issues on Consumers dont impact Producers • Consumers are protected from herds of Producers • Lots of flexibility in handling load • Messages are available for anyone – lots of new use cases, monitoring, audit, troubleshooting http://www.slideshare.net/gwenshap/queues-pools-caches
  • 31.
    My Favorite UseCases • Shops consume inventory updates • Clicking around an online shop? Your clicks go to Kafka and recommendations come back. • Flagging credit card transactions as fraudulent • Flagging game interactions as abuse • Least favorite: Surge pricing in Uber • Huge list of users at kafka.apache.org 31
  • 32.
    Got it! But whatabout real-time ETL?
  • 33.
    Remember This? 33 Source SystemSource System Source System Source System Kafka decouples Data Pipelines Hadoop Security Systems Real-time monitoring Data Warehouse Kafka Producers Brokers Consumers Kafka is smack in middle of all Data Pipelines
  • 34.
    If data fliesinto Kafka in real time Why wait 24h before pulling it into a DWH? 34
  • 35.
  • 36.
    Why Kafka makesreal-time ETL better? • Can integrate with any data source • RDBMS, NoSQL, Applications, web applications, logs • Consumers can be real-time But they do not have to • Reading and writing to/from Kafka is cheap • So this is a great place to store intermediate state • You can fix mistakes by rereading some of the data again • Same data in same order • Adding more pipelines / aggregations has no impact on source systems = low risk 36
  • 37.
    It is allvaluable data Raw data Raw data Clean data Aggregated dataClean data Enriched data Filtered data Dash board Report Data scientist Alerts OMG
  • 38.
    • Producers • Log4J •Rest Proxy • BottledWater • KafkaConnect and its connectors ecosystem • Other ecosystem OK, but how does my data get into Kafka
  • 41.
    • However youwant: • You just consume data, modify it, and produce it back • Built into Kafka: • Kprocessor • Kstream • Popular choices: • Storm • SparkStreaming But wait, how do we process the data?
  • 42.
  • 43.
    Schema is aMUST HAVE for data integration
  • 44.
    Need More Kafka? •https://kafka.apache.org/documentation.html • My video tutorial: http://shop.oreilly.com/product/0636920038603.do • http://www.michael-noll.com/blog/2014/08/18/apache-kafka- training-deck-and-tutorial/ • Our website: http://confluent.io • Oracle guide to real-time ETL: http://www.oracle.com/technetwork/middleware/data- integrator/overview/best-practices-for-realtime-data-wa-132882.pdf

Editor's Notes

  • #14 Topics are partitioned, each partition ordered and immutable. Messages in a partition have an ID, called Offset. Offset uniquely identifies a message within a partition
  • #15 Kafka retains all messages for fixed amount of time. Not waiting for acks from consumers. The only metadata retained per consumer is the position in the log – the offset So adding many consumers is cheap On the other hand, consumers have more responsibility and are more challenging to implement correctly And “batching” consumers is not a problem
  • #16 3 partitions, each replicated 3 times.
  • #17 The choose how many replicas must ACK a message before its considered committed. This is the tradeoff between speed and reliability
  • #18 The choose how many replicas must ACK a message before its considered committed. This is the tradeoff between speed and reliability
  • #19 can read from one or more partition leader. You can’t have two consumers in same group reading the same partition. Leaders obviously do more work – but they are balanced between nodes We reviewed the basic components on the system, and it may seem complex. In the next section we’ll see how simple it actually is to get started with Kafka.
  • #25 Then we end up adding clients to use that source.
  • #26 But as we start to deploy our applications we realizet hat clients need data from a number of sources. So we add them as needed.
  • #28 But over time, particularly if we are segmenting services by function, we have stuff all over the place, and the dependencies are a nightmare. This makes for a fragile system.
  • #30 Kafka is a pub/sub messaging system that can decouple your data pipelines. Most of you are probably familiar with it’s history at LinkedIn and they use it as a high throughput relatively low latency commit log. It allows sources to push data without worrying about what clients are reading it. Note that producer push, and consumers pull. Kafka itself is a cluster of brokers, which handles both persisting data to disk and serving that data to consumer requests.
  • #34 Kafka is a pub/sub messaging system that can decouple your data pipelines. Most of you are probably familiar with it’s history at LinkedIn and they use it as a high throughput relatively low latency commit log. It allows sources to push data without worrying about what clients are reading it. Note that producer push, and consumers pull. Kafka itself is a cluster of brokers, which handles both persisting data to disk and serving that data to consumer requests.
  • #40 Logical decoding output client API
  • #44 Sorry, but “Schema on Read” is kind of B.S. We admit that there is a schema, but we want to “ingest fast”, so we shift the burden to the readers. But the data is written once and read many many times by many different people. They each need to figure this out on their own? This makes no sense. Also, how are you going to validate the data without a schema?
  • #45 https://github.com/schema-repo/schema-repo There’s no data dictionary for Kafka
  • #53 There are many options for handling excessing user requests. The only thing that is not an option – throw everything at the database and let the DB queue the excessive load