1Confidential Streaming Data Integration with Apache Kafka Presented by: David Tucker | Dir. Partner Engineering partners@confluent.io david@confluent.io
3Confidential Today’s Discussion • The evolving world of data integration • Design Considerations • The Kafka Solution • Kafka Connect • Logical Architecture • Core Components and Execution Model • Connector Examples • Wrap Up and Questions
4Confidential
5Confidential
6Confidential
7Confidential
11Confidential
12Confidential Explosion of Operational Data Stores and Processing Frameworks
13Confidential Abstract View: Many Ad Hoc Pipelines Search Security Fraud Detection Application User Tracking Operational Logs Operational Metrics Hadoop App Data Warehouse Espresso Cassandra Oracle Databases Storage Interfaces Monitoring App Databases Storage Interfaces
14Confidential Re-imagined Architecture: Streaming Platform with Kafka ü Distributed ü Fault Tolerant ü Stores Messages Search Security Fraud Detection Application User Tracking Operational Logs Operational MetricsEspresso Cassandra Oracle Hadoop App Monitoring App Data Warehouse Kafka ü Processes Streams Kafka StreamsKafka Streams
16Confidential Design Considerations : These Things Matter • Reliability and Delivery Semantics – Losing data is (usually) not OK. • Exactly Once vs At Least Once vs (very rarely) At Most Once • Timeliness • Push vs Pull • High Throughput, Varying Throughput • Compression, Parallelism, Back Pressure • Data Formats • Flexibility, Structure • Security • Error Handling
18Confidential 1 8
19Confidential 1 9
20Confidential Introducing Kafka Connect Simplified, scalable data integration via Apache Kafka
21Confidential
22Confidential Kafka Connect : Separation of Concerns
23Confidential Kafka Connect: Logical Model Kafka Connect Apache Kafka Brokers Schema Registry
24Confidential How is Connect different than a producer or consumer? • Producers and consumers enable total flexibility; data is published and processed in any way • This flexibility means you do everything yourself. • Kafka Connect’s simple framework allows : • developers to create connectors that copy data to/from other systems • operators/users to use said connectors just by writing configuration files and submitting them to Connect -- no code necessary • community and 3rd-party engineers to build reliable plugins for common data sources and sinks • deployments to deliver scalability, fault tolerance and automated load balancing out-of- the-box
25Confidential
26Confidential
27Confidential
28Confidential
29Confidential
30Confidential
31Confidential
32Confidential
34Confidential Connector Hub: http://www.confluent.io/product/connectors • Confluent-supported connectors (included in CP) • Partner/Community-written connectors (just a sampling) JDBC
35Confidential Kafka Connect Example: MySQL to Hive Pipeline • Blog at http://confluent.io/blog/how-to-build-a-scalable-etl-pipeline-with-kafka-connect/
36Confidential MySQL to Hive Pipeline : Step by Step • Configure the JDBC Source Connector with the MySQL details • User authentication • Tables to replicate; polling interval for change-data-capture • Configure the HDFS Sink Connector with Hadoop Details • Target HDFS directory • Hive metastore details • Partitioning details (optional) • Watch it go !!! • What you can’t see • Source and Sink scalability • Table metadata changes are captured in Schema Registry
37Confidential
38Confidential Thank You Questions ?

Data integration with Apache Kafka