Slide 1

Slide 1 text

1 Real-time Data Integration at Scale with Kafka Connect Robin Moffatt Partner Technology Evangelist, EMEA @ Confluent @rmoff [email protected]

Slide 2

Slide 2 text

2

Slide 3

Slide 3 text

3

Slide 4

Slide 4 text

4

Slide 5

Slide 5 text

5 Kafka Connect in the Apache Kafka ecosystem

Slide 6

Slide 6 text

6 Kafka Connect : Separation of Concerns

Slide 7

Slide 7 text

7

Slide 8

Slide 8 text

8 Single Message Transform (SMT) -- Extract, TRANSFORM, Load… • Modify events before storing in Kafka: • Mask/drop sensitive information • Set partitioning key • Store lineage • Modify events going out of Kafka: • Route high priority events to faster data stores • Direct events to different ElasticSearch indexes • Cast data types to match destination

Slide 9

Slide 9 text

9 Kafka Connect API Library of Connectors Databases Analytics Applications / Other Datastore/File Store https://www.confluent.io/product/connectors/

Slide 10

Slide 10 text

10 Streaming Application Data to Kafka • Applications are rich source of events • Modifying applications is not always possible or desirable • And what if the data gets changed within the database or by other apps? • JDBC is one option for extracting data

Slide 11

Slide 11 text

11 Liberate Application Data into Kafka with CDC • Relational databases use transaction logs to ensure Durability of data • Change-Data-Capture (CDC) mines the log to get raw events from the database • CDC tools that integrate with Kafka Connect include: • Debezium • DBVisit • GoldenGate • Attunity • + more

Slide 12

Slide 12 text

12 Kafka Connect Common Patterns – Data Integration into Data Lake for batch analytics Oracle DB2 MS SQL Postgres MySQL Cassandra MongoDB Couchbase HBase S3 / Athena HDFS BigQuery Elasticsearch Solr CRM ERP WebApp Twitter IRC Bloomberg … Kafka Connect Mainframe (e.g. VSAM)

Slide 13

Slide 13 text

13 Common Patterns – Event-Driven microservices CRM WebApp Orders Service Stock Service Cassandra MongoDB Couchbase HBase S3 / Athena HDFS BigQuery Elasticsearch Solr Kafka Connect Oracle DB2 MS SQL Postgres MySQL Twitter IRC Bloomberg … Kafka Connect Mainframe (e.g. VSAM) ERP

Slide 14

Slide 14 text

14 Common Patterns – Event-Driven microservices & audit/search/storage CRM WebApp Orders Service Stock Service Cassandra MongoDB Couchbase HBase S3 / Athena HDFS BigQuery Elasticsearch Solr Kafka Connect Oracle DB2 MS SQL Postgres MySQL Twitter IRC Bloomberg … Kafka Connect Mainframe (e.g. VSAM) ERP

Slide 15

Slide 15 text

15 The Numerous Benefits of Kafka Connect • Restart capabilities (offset management) • Distributed workers • Parallelism (for throughput) • Load balancing • Fault tolerance • Schema preservation • Data serialisation • Centralised management and configuration

Slide 16

Slide 16 text

16 Kafka Connect – under the covers • Each Kafka Connect node is a worker • Each worker executes one or more tasks • Tasks do the actual work of pulling data from sources / landing it to sinks • Kafka Connect manages the distribution and execution of tasks • Parallelism, fault-tolerance, load balancing all handled automatically

Slide 17

Slide 17 text

17 Kafka Connect – under the covers • Each Kafka Connect node is a worker • Each worker executes one or more tasks • Tasks do the actual work of pulling data from sources / landing it to sinks • Kafka Connect manages the distribution and execution of tasks • Parallelism, fault-tolerance, load balancing all handled automatically

Slide 18

Slide 18 text

18 Kafka Connect – under the covers • Each Kafka Connect node is a worker • Each worker executes one or more tasks • Tasks do the actual work of pulling data from sources / landing it to sinks • Kafka Connect manages the distribution and execution of tasks • Parallelism, fault-tolerance, load balancing all handled automatically

Slide 19

Slide 19 text

19 Kafka Connect – Standalone vs Distributed • Kafka Connect has two modes: standalone or distributed • Distributed - Scaleout & fault tolerance easy – just add more workers • Can run on one node! • Standalone - Useful for where data source is machine-specific (e.g. single-node log files)

Slide 20

Slide 20 text

20 Kafka Connect - Converters • Data from source system is in its own format (e.g. RecordSet from JDBC) • Kafka Connect’s Converters provide reusable functionality to serialise data into JSON or Avro • The Confluent Schema Registry is used to stores schemas of ingested data http://docs.confluent.io/current/connect/concepts.html#converters

Slide 21

Slide 21 text

21 Configuring Kafka Connect - REST API • Configure & control Kafka Connect through REST API • Validate connector configuration • Create connectors • List available plugins • Query connector & task state • Pause, resume, restart connectors + tasks • Configuration is persisted through a Kafka topic • Reference : http://docs.confluent.io/current/connect/restapi.html

Slide 22

Slide 22 text

22 Configure Kafka Connect with Confluent Control Center

Slide 23

Slide 23 text

23 Monitor Your Data Pipeline from End to End with Confluent Control Center

Slide 24

Slide 24 text

25

Slide 25

Slide 25 text

26 Confluent: a Streaming Platform based on Apache Kafka™ Database Changes Log Events loT Data Web Events … CRM Data Warehouse Database Hadoop Data Integration … Monitoring Analytics Custom Apps Transformations Real-time Applications … Apache Open Source Confluent Open Source Confluent Enterprise Confluent Platform Confluent Platform Apache Kafka™ Core | Connect| Streams Data Compatibility Schema Registry Monitoring & Administration Confluent Control Center Operations Replicator | Auto Data Balancing Development and Connectivity Clients | Connectors | REST Proxy

Slide 26

Slide 26 text

27 Kafka Connect – Getting Started • Docs : http://docs.confluent.io/current/connect/ • Includes Quickstart and full Connect documentation including Architecture + Internals • Official Confluent Platform Docker images available • http://docs.confluent.io/current/cp-docker- images/docs/quickstart.html#kafka-connect • List of connectors • https://www.confluent.io/product/connectors/ • Also search on github https://github.com/search?q=kafka-connect https://www.confluent.io/download/ @rmoff [email protected]