Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, Streams vs. Databases - Zurich Apache Kafka Meetup 19 September 2017

Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, Streams vs. Databases - Zurich Apache Kafka Meetup 19 September 2017

My talk at Apache Kafka meetup, Zurich/Switzerland, September 19, 2017.

https://www.meetup.com/Zurich-Apache-Kafka-Meetup-by-Confluent/events/242063921/

Abstract:
Modern businesses have data at their core, but this data is changing continuously. How can you harness this torrent of information in real time? The answer: stream processing.

The core platform for streaming data is Apache Kafka, and thousands of companies are using Kafka to transform and reshape their industries, including Netflix, Uber, PayPal, Airbnb, Goldman Sachs, Cisco, and Oracle. Unfortunately, today’s common architectures for real-time data processing at scale suffer from complexity: to succeed, many technologies need to be stitched and operated together, and each individual technology is often complex by itself. This has led to a strong discrepancy between how we engineers would like to work and how we actually end up working in practice.

Michael Noll explains how Apache Kafka helps you radically simplify your data processing architectures by building normal applications to serve your real-time processing needs rather than building clusters or similar special-purpose infrastructure—while still benefiting from properties typically associated exclusively with cluster technologies, like high scalability, distributed computing, and fault tolerance. Michael also covers Kafka’s Streams API, its abstractions for streams and tables, and its recently introduced interactive queries functionality. Along the way, Michael shares common use cases that demonstrate that stream processing in practice often requires database-like functionality and how Kafka allows you to bridge the worlds of streams and databases when implementing your own core business applications (for example, in the form of event-driven, containerized microservices). As you’ll see, Kafka makes such architectures equally viable for small-, medium-, and large-scale use cases.

Michael G. Noll

September 19, 2017
Tweet

More Decks by Michael G. Noll

Other Decks in Programming

Transcript

  1. 1
    Rethinking Stream Processing
    with Apache Kafka:
    Applications vs. Clusters,
    Streams vs. Databases
    An introduction to Kafka’s Streams API
    Target audience: technical staff, developers, architects
    Expected duration: 40 minutes

    View full-size slide

  2. 2
    0.11 Exactly-once
    semantics
    0.10 Data processing (Streams API)
    0.9 Data integration (Connect API)
    Intra-cluster
    replication
    0.8
    2012 2014 2015 2016 2017
    Cluster mirroring
    0.7
    2013
    Apache Kafka: birthed as a messaging system, now a streaming platform

    View full-size slide

  3. 13
    (Does NOT run inside
    the Kafka brokers!)

    View full-size slide

  4. 14
    (Does NOT run inside
    the Kafka brokers!)

    View full-size slide

  5. 18
    http://docs.confluent.io/current/streams/kafka-streams-examples/docs/index.html

    View full-size slide

  6. 21
    Before
    With Kafka’s
    Streams API

    View full-size slide

  7. 22
    KStream input =
    builder.stream("numbers-topic");
    // Stateless computation
    KStream doubled =
    input.mapValues(v -> v * 2);
    // Stateful computation
    KTable sumOfOdds = input
    .filter((k,v) -> v % 2 != 0)
    .selectKey((k, v) -> 1)
    .groupByKey()
    .reduce((v1, v2) -> v1 + v2, "sum-of-odds");
    class PrintToConsoleProcessor
    implements Processor {
    @Override
    public void init(ProcessorContext context) {}
    @Override
    void process(K key, V value) {
    System.out.println("Got value " + value);
    }
    @Override
    void punctuate(long timestamp) {}
    @Override
    void close() {}
    }

    View full-size slide

  8. 24
    Linux Windows

    View full-size slide

  9. 30
    http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple
    https://kafka.apache.org/documentation/streams#streams_duality

    View full-size slide

  10. 43
    …and many more…

    View full-size slide

  11. 44
    …and many more…

    View full-size slide

  12. 47
    2016 2017
    First release of Kafka’s
    Streams API (0.10.0.0)
    today
    Kafka Streams API in the wild
    Kafka 0.10.2.1
    In production at LINE Corp., Japan
    220+ million active users, processing millions of msg/s
    “Applying Kafka Streams for internal message delivery pipeline”
    https://engineering.linecorp.com/en/blog/detail/80

    View full-size slide

  13. 49
    Supported since Apache Kafka 0.11 (June 2017)

    View full-size slide

  14. 58
    …and more…

    View full-size slide

  15. 60
    $ curl -sXGET http://localhost:7070/kafka-music/charts/top-five
    [
    {
    "artist": "Subhumans",
    "album": "Live In A Dive",
    "name": "All Gone Dead",
    "plays": 126
    },
    {
    "artist": "Wheres The Pope?",
    "album": "PSI",
    "name": "Fear Of God",
    "plays": 115
    },
    ...
    ]

    View full-size slide

  16. 64
    https://kafka.apache.org/documentation/streams
    http://docs.confluent.io/current/streams/
    https://www.confluent.io/downloads/

    View full-size slide

  17. 65
    KSQL: a Streaming SQL Engine for Apache Kafka™ from Confluent
    ü No coding required, all you need is SQL
    ü No separate processing cluster required
    ü Powered by Kafka: elastic, scalable,
    distributed, battle-tested
    CREATE TABLE possible_fraud AS
    SELECT card_number, count(*)
    FROM authorization_attempts
    WINDOW TUMBLING (SIZE 5 SECONDS)
    GROUP BY card_number
    HAVING count(*) > 3;
    CREATE STREAM vip_actions AS
    SELECT userid, page, action
    FROM clickstream c
    LEFT JOIN users u
    ON c.userid = u.userid
    WHERE u.level = ‘Platinum’;
    KSQL is the simplest way to process streams of data in real-time
    ü Perfect for streaming ETL, anomaly detection,
    event monitoring, and more
    ü Part of Confluent Open Source
    https://github.com/confluentinc/ksql

    View full-size slide