Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Real-time Data Integration at Scale with Kafka Connect - Dublin Apache Kafka Meetup 04 Jul 2017

Real-time Data Integration at Scale with Kafka Connect - Dublin Apache Kafka Meetup 04 Jul 2017

Apache Kafka is a streaming data platform. It enables integration of data across the enterprise, and ships with its own stream processing capabilities. But how do we get data in and out of Kafka in an easy, scalable, and standardised manner? Enter Kafka Connect.
Part of Apache Kafka since 0.9, Kafka Connect defines an API that enables the integration of data from multiple sources, including MQTT, common NoSQL stores, and CDC from relational databases such as Oracle. By "turning the database inside out" we can enable an event-driven architecture in our business that reacts to changes made by applications writing to a database, without having to modify those applications themselves. As well as ingest, Kafka Connect has connectors with support for numerous targets, including HDFS, S3, and Elasticsearch.
This presentation will briefly recap the purpose of Kafka, and then dive into Kafka Connect, with practical examples of data pipelines that can be built with it and are in production at companies around the world already. We'll also look at the Single Message Transform (SMT) capabilities introduced with Kafka 0.10.2 and how they can make Kafka Connect even more flexible and powerful.

Robin Moffatt

July 04, 2017
Tweet

More Decks by Robin Moffatt

Other Decks in Technology

Transcript

  1. 1
    Real-time Data Integration at
    Scale with Kafka Connect
    Robin Moffatt
    Partner Technology Evangelist, EMEA @ Confluent
    @rmoff [email protected]

    View full-size slide

  2. 5
    Kafka Connect in the Apache Kafka ecosystem

    View full-size slide

  3. 6
    Kafka Connect : Separation of Concerns

    View full-size slide

  4. 8
    Single Message Transform (SMT) -- Extract, TRANSFORM, Load…
    • Modify events before storing in Kafka:
    • Mask/drop sensitive information
    • Set partitioning key
    • Store lineage
    • Modify events going out of Kafka:
    • Route high priority events to faster
    data stores
    • Direct events to different
    ElasticSearch indexes
    • Cast data types to match destination

    View full-size slide

  5. 9
    Kafka Connect API Library of Connectors
    Databases
    Analytics Applications / Other
    Datastore/File Store
    https://www.confluent.io/product/connectors/

    View full-size slide

  6. 10
    Streaming Application Data to Kafka
    • Applications are rich source of events
    • Modifying applications is not always possible or
    desirable
    • And what if the data gets changed within the
    database or by other apps?
    • JDBC is one option for extracting data

    View full-size slide

  7. 11
    Liberate Application Data into Kafka with CDC
    • Relational databases use transaction logs to
    ensure Durability of data
    • Change-Data-Capture (CDC) mines the log to get
    raw events from the database
    • CDC tools that integrate with Kafka Connect
    include:
    • Debezium
    • DBVisit
    • GoldenGate
    • Attunity
    • + more

    View full-size slide

  8. 12
    Kafka Connect
    Common Patterns – Data Integration into Data Lake for batch analytics
    Oracle
    DB2
    MS SQL
    Postgres
    MySQL
    Cassandra
    MongoDB
    Couchbase
    HBase
    S3 / Athena
    HDFS
    BigQuery
    Elasticsearch
    Solr
    CRM ERP
    WebApp
    Twitter
    IRC
    Bloomberg

    Kafka Connect
    Mainframe
    (e.g. VSAM)

    View full-size slide

  9. 13
    Common Patterns – Event-Driven microservices
    CRM WebApp
    Orders Service
    Stock Service
    Cassandra
    MongoDB
    Couchbase
    HBase
    S3 / Athena
    HDFS
    BigQuery
    Elasticsearch
    Solr
    Kafka Connect
    Oracle
    DB2
    MS SQL
    Postgres
    MySQL
    Twitter
    IRC
    Bloomberg

    Kafka Connect
    Mainframe
    (e.g. VSAM)
    ERP

    View full-size slide

  10. 14
    Common Patterns – Event-Driven microservices & audit/search/storage
    CRM WebApp
    Orders Service
    Stock Service
    Cassandra
    MongoDB
    Couchbase
    HBase
    S3 / Athena
    HDFS
    BigQuery
    Elasticsearch
    Solr
    Kafka Connect
    Oracle
    DB2
    MS SQL
    Postgres
    MySQL
    Twitter
    IRC
    Bloomberg

    Kafka Connect
    Mainframe
    (e.g. VSAM)
    ERP

    View full-size slide

  11. 15
    The Numerous Benefits of Kafka Connect
    • Restart capabilities (offset management)
    • Distributed workers
    • Parallelism (for throughput)
    • Load balancing
    • Fault tolerance
    • Schema preservation
    • Data serialisation
    • Centralised management and configuration

    View full-size slide

  12. 16
    Kafka Connect – under the covers
    • Each Kafka Connect node is a worker
    • Each worker executes one or more
    tasks
    • Tasks do the actual work of pulling
    data from sources / landing it to sinks
    • Kafka Connect manages the
    distribution and execution of tasks
    • Parallelism, fault-tolerance, load
    balancing all handled automatically

    View full-size slide

  13. 17
    Kafka Connect – under the covers
    • Each Kafka Connect node is a worker
    • Each worker executes one or more
    tasks
    • Tasks do the actual work of pulling
    data from sources / landing it to sinks
    • Kafka Connect manages the
    distribution and execution of tasks
    • Parallelism, fault-tolerance, load
    balancing all handled automatically

    View full-size slide

  14. 18
    Kafka Connect – under the covers
    • Each Kafka Connect node is a worker
    • Each worker executes one or more
    tasks
    • Tasks do the actual work of pulling
    data from sources / landing it to sinks
    • Kafka Connect manages the
    distribution and execution of tasks
    • Parallelism, fault-tolerance, load
    balancing all handled automatically

    View full-size slide

  15. 19
    Kafka Connect – Standalone vs Distributed
    • Kafka Connect has two modes:
    standalone or distributed
    • Distributed - Scaleout & fault
    tolerance easy – just add more
    workers
    • Can run on one node!
    • Standalone - Useful for where
    data source is machine-specific
    (e.g. single-node log files)

    View full-size slide

  16. 20
    Kafka Connect - Converters
    • Data from source system is in its own format (e.g. RecordSet from JDBC)
    • Kafka Connect’s Converters provide reusable functionality to serialise data into JSON or Avro
    • The Confluent Schema Registry is used to stores schemas of ingested data
    http://docs.confluent.io/current/connect/concepts.html#converters

    View full-size slide

  17. 21
    Configuring Kafka Connect - REST API
    • Configure & control Kafka Connect through REST API
    • Validate connector configuration
    • Create connectors
    • List available plugins
    • Query connector & task state
    • Pause, resume, restart connectors + tasks
    • Configuration is persisted through a Kafka topic
    • Reference :
    http://docs.confluent.io/current/connect/restapi.html

    View full-size slide

  18. 22
    Configure Kafka Connect with Confluent Control Center

    View full-size slide

  19. 23
    Monitor Your Data Pipeline from End to End with Confluent Control Center

    View full-size slide

  20. 26
    Confluent: a Streaming Platform based on Apache Kafka™
    Database
    Changes
    Log Events loT Data Web Events …
    CRM
    Data Warehouse
    Database
    Hadoop
    Data
    Integration

    Monitoring
    Analytics
    Custom Apps
    Transformations
    Real-time Applications

    Apache Open Source Confluent Open Source Confluent Enterprise
    Confluent Platform
    Confluent Platform
    Apache Kafka™
    Core | Connect| Streams
    Data Compatibility
    Schema Registry
    Monitoring & Administration
    Confluent Control Center
    Operations
    Replicator | Auto Data Balancing
    Development and Connectivity
    Clients | Connectors | REST Proxy

    View full-size slide

  21. 27
    Kafka Connect – Getting Started
    • Docs : http://docs.confluent.io/current/connect/
    • Includes Quickstart and full Connect
    documentation including Architecture + Internals
    • Official Confluent Platform Docker images
    available
    • http://docs.confluent.io/current/cp-docker-
    images/docs/quickstart.html#kafka-connect
    • List of connectors
    • https://www.confluent.io/product/connectors/
    • Also search on github
    https://github.com/search?q=kafka-connect
    https://www.confluent.io/download/
    @rmoff [email protected]

    View full-size slide