$30 off During Our Annual Pro Sale. View Details »

Kafka Summit London 2018 - Look Ma, no Code! Building Streaming Data Pipelines with Apache Kafka

Kafka Summit London 2018 - Look Ma, no Code! Building Streaming Data Pipelines with Apache Kafka

Have you ever thought that you needed to be a programmer to do stream processing and build streaming data pipelines? Think again!

Companies new and old are all recognising the importance of a low-latency, scalable, fault-tolerant data backbone, in the form of the Apache Kafka streaming platform. With Kafka, developers can integrate multiple sources and systems, which enables low latency analytics, event driven architectures and the population of multiple downstream systems. These data pipelines can be built using configuration alone.

In this talk, we'll see how easy it is to stream data from a database such as Oracle into Kafka using the Kafka Connect API. In addition, we'll use KSQL to filter, aggregate and join it to other data, and then stream this from Kafka out into multiple targets such as Elasticsearch and MySQL. All of this can be accomplished without a single line of code!

Why should Java geeks have all the fun?

Robin Moffatt

April 24, 2018
Tweet

More Decks by Robin Moffatt

Other Decks in Technology

Transcript

  1. Look Ma, no Code! 

    Building Streaming Data
    Pipelines with Apache Kafka
    Kafka Summit London
    24 Apr 2018 / Robin Moffatt
    @rmoff [email protected]
    https://speakerdeck.com/rmoff/

    View Slide

  2. 2
    Let’s take a trip back in time. Each application has its
    own database for storing information. But we want
    that information elsewhere for analytics and
    reporting.

    View Slide

  3. 3
    We don't want to query the transactional system, so
    we create a process to extract from the source to a
    data warehouse / lake

    View Slide

  4. 4
    Let’s take a trip back in time
    We want to unify data from multiple systems, so
    create conformed dimensions and batch processes
    to federate our data. This is all batch driven, so
    latency is built in by design.

    View Slide

  5. 5
    Let’s take a trip back in time
    As well as our data warehouse, we want to use our
    transactional data to populate search replicas, Graph
    databases, noSQL stores…all introducing more point-
    to-point dependencies in our system

    View Slide

  6. 6
    Let’s take a trip back in time
    Ultimately we end up with a spaghetti architecture. It
    can't scale easily, it's tightly coupled, it's generally
    batch-driven and we can't get data when we want it
    where we want it.

    View Slide

  7. 7
    But…there's hope!

    View Slide

  8. 8
    Apache Kafka, a distributed streaming platform,
    enables us to decouple all our applications creating
    data from those utilising it. We can create low-
    latency streams of data, transformed as necessary.

    View Slide

  9. 9
    But…to use stream processing, we need to be Java
    coders…don't we?

    View Slide

  10. 10
    Happy days! We can actually build streaming data
    pipelines using just our bare hands, configuration
    files, and SQL.

    View Slide

  11. 11
    Streaming ETL, with Apache Kafka and Confluent Platform

    View Slide

  12. 12
    $ whoami
    • Partner Technology Evangelist @ Confluent
    • Working in data & analytics since 2001
    • Oracle ACE Director & Dev Champion
    • Blogging : http://rmoff.net & 

    https://www.confluent.io/blog/author/robin/
    • Twitter: @rmoff
    • Geek stuff
    • Beer & Fried Breakfasts
    https://speakerdeck.com/rmoff/

    View Slide

  13. 13

    View Slide

  14. 14

    View Slide

  15. 15
    Build Pipelines
    Raw logs Raw logs HDFS / S3
    App
    Server

    View Slide

  16. 16
    Build Complex Pipelines
    Raw logs Raw logs
    Error logs
    Elasticsearch
    HDFS / S3
    Stream
    Processing
    App
    Server

    View Slide

  17. 17
    Build Applications
    Raw logs
    SLA
    breaches
    Alert App
    Stream
    Processing
    App
    Server

    View Slide

  18. 18
    Build Applications + Pipelines
    Raw logs Raw logs
    Error logs
    SLA
    breaches
    Elasticsearch
    HDFS / S3
    Alert App
    Stream
    Processing
    App
    Server

    View Slide

  19. 19

    View Slide

  20. Confluent Partner Briefing 20
    The Connect API of Apache Kafka®
    ✓ Centralized management and configuration
    ✓ Support for hundreds of technologies including
    RDBMS, Elasticsearch, HDFS, S3
    ✓ Supports CDC ingest of events from RDBMS
    ✓ Preserves data schema
    ✓ Fault tolerant and automatically load balanced
    ✓ Extensible API
    ✓ Single Message Transforms
    ✓ Part of Apache Kafka, included in

    Confluent Open Source
    Reliable and scalable integration of Kafka
    with other systems – no coding required.
    {
    "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
    "connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo",
    "table.whitelist": "sales,orders,customers"
    }
    https://docs.confluent.io/current/connect/

    View Slide

  21. 21
    Kafka Connect
    Kafka Brokers
    Kafka Connect
    Tasks Workers
    Sources Sinks
    Amazon S3
    syslog
    flat file
    CSV
    JSON

    View Slide

  22. 22
    Streaming Application Data to Kafka
    • Modifying applications is not always possible or desirable
    • And what if the data gets changed within the database or
    by other apps?
    • Query-based CDC with Kafka Connect's JDBC source is one
    option for extracting data
    • Polls the source DB --> Latency & Load implications
    • Requires key/ts in the schema
    • Can't capture deletes
    https://www.confluent.io/blog/no-more-silos-how-to-integrate-your-databases-with-apache-kafka-and-cdc

    View Slide

  23. 23
    Liberate Application Data into Kafka with log-based CDC
    • Relational databases use transaction logs to
    ensure Durability of data
    • Log-based Change-Data-Capture (CDC) mines the
    log to get raw events from the database
    • CDC tools that integrate with Kafka Connect
    include:
    • Debezium
    • GoldenGate
    • Attunity
    • + more
    https://www.confluent.io/blog/no-more-silos-how-to-integrate-your-databases-with-apache-kafka-and-cdc

    View Slide

  24. 24
    Single Message Transform (SMT) -- Extract, TRANSFORM, Load…
    • Modify events before storing in Kafka:
    • Mask/drop sensitive information
    • Set partitioning key
    • Store lineage
    • Modify events going out of Kafka:
    • Route high priority events to faster
    data stores
    • Direct events to different
    Elasticsearch indexes
    • Cast data types to match destination

    View Slide

  25. 25
    But I need to
    join…aggregate…filter…

    View Slide

  26. Declarative
    Stream
    Language
    Processing
    KSQL
    is a

    View Slide

  27. KSQL
    is the
    Streaming
    SQL Engine
    for
    Apache Kafka

    View Slide

  28. 28
    KSQL: a Streaming SQL Engine for Apache Kafka™ from Confluent
    • Enables stream processing with zero coding required
    • The simplest way to process streams of data in real-time
    • Powered by Kafka: scalable, distributed, battle-tested
    • All you need is Kafka–No complex deployments of bespoke systems for
    stream processing
    Ksql>

    View Slide

  29. KSQL for Streaming ETL
    CREATE STREAM vip_actions AS 

    SELECT userid, page, action
    FROM clickstream c
    LEFT JOIN users u
    ON c.userid = u.user_id 

    WHERE u.level = 'Platinum';
    Joining, filtering, and aggregating streams of event data

    View Slide

  30. KSQL for Anomaly Detection
    CREATE TABLE possible_fraud AS

    SELECT card_number, count(*)

    FROM authorization_attempts 

    WINDOW TUMBLING (SIZE 5 SECONDS)

    GROUP BY card_number

    HAVING count(*) > 3;
    Identifying patterns or anomalies in real-time data,
    surfaced in milliseconds

    View Slide

  31. KSQL for Real-Time Monitoring
    • Log data monitoring, tracking and alerting
    • syslog data
    • Sensor / IoT data
    CREATE STREAM SYSLOG_INVALID_USERS AS
    SELECT HOST, MESSAGE
    FROM SYSLOG
    WHERE MESSAGE LIKE '%Invalid user%';
    http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting

    View Slide

  32. CREATE STREAM views_by_userid
    WITH (PARTITIONS=6, REPLICAS=5,
    VALUE_FORMAT='AVRO',
    TIMESTAMP='view_time') AS 

    SELECT * FROM clickstream
    PARTITION BY user_id;
    KSQL for Data Transformation
    Make simple derivations of existing topics from the command line

    View Slide

  33. 33
    Streaming ETL, powered by Apache Kafka and Confluent Platform
    KSQL

    View Slide

  34. Demo Time!
    34

    View Slide

  35. ratings

    stream
    MySQL
    Debezium
    Kafka Connect
    Elasticsearch
    Kafka Connect

    View Slide

  36. View Slide

  37. ratings

    stream
    MySQL
    Debezium
    Kafka Connect
    Elasticsearch
    Kafka Connect

    View Slide

  38. ratings

    stream
    MySQL
    Debezium
    Kafka Connect
    Elasticsearch
    Kafka Connect
    slack_notify.py

    View Slide

  39. 39
    Streaming ETL, powered by Apache Kafka and Confluent Platform
    KSQL

    View Slide

  40. 40
    Confluent Open Source : Apache Kafka with a bunch of cool stuff! For free!
    Database Changes Log Events loT Data Web Events …
    CRM
    Data Warehouse
    Database
    Hadoop
    Data

    Integration

    Monitoring
    Analytics
    Custom Apps
    Transformations
    Real-time Applications

    Apache Open Source Confluent Open Source Confluent Enterprise
    Confluent Platform
    Confluent Platform
    Apache Kafka®
    Core | Connect API | Streams API
    Data Compatibility
    Schema Registry
    Monitoring & Administration
    Confluent Control Center | Security
    Operations
    Replicator | Auto Data Balancing
    Development and Connectivity
    Clients | Connectors | REST Proxy | CLI
    Apache Open Source Confluent Open Source Confluent Enterprise
    SQL Stream Processing
    KSQL

    View Slide

  41. 41
    Streaming ETL, powered by Apache Kafka and Confluent Platform
    @rmoff [email protected]
    https://slackpass.io/confluentcommunity
    https://www.confluent.io/download/

    View Slide

  42. 42
    Kafka Summit San Francisco 2018
    Dates: October 16-17, 2018
    Location: Pier 27, San Francisco
    CFP Opens: April 30, 2018
    www.kafka-summit.org
    Presented by

    View Slide

  43. Look Ma, no Code! 

    Building Streaming Data
    Pipelines with Apache Kafka
    Kafka Summit London
    24 Apr 2018 / Robin Moffatt
    @rmoff [email protected]
    https://speakerdeck.com/rmoff/

    View Slide