Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Kafka Streams with a Real-Life Example

Introduction to Kafka Streams with a Real-Life Example

Alexis Seigneurin

January 11, 2017
Tweet

More Decks by Alexis Seigneurin

Other Decks in Technology

Transcript

  1. INTRODUCTION TO
    KAFKA STREAMS
    WITH A REAL-LIFE EXAMPLE
    Alexis Seigneurin

    View Slide

  2. Who I am
    • Software engineer for 15+ years
    • Consultant at Ippon USA, previously at Ippon France
    • Favorite subjects: Spark, Kafka, Machine Learning, Cassandra
    • Spark certified
    • @aseigneurin

    View Slide

  3. • 200 software engineers in France, the US and Australia
    • In the US: offices in DC, NYC and Richmond, Virginia
    • Digital, Big Data and Cloud applications
    • Java & Agile expertise
    • Open-source projects: JHipster, Tatami, etc.
    • @ipponusa

    View Slide

  4. The project

    View Slide

  5. The project
    • Analyze records from customers → Give feedback to the customer on their data
    • High volume of data
    • 25 millions records per day (average)
    • Need to keep at least 60 days of history = 1.5 Billion records
    • Seasonal peaks...
    • Need an hybrid platform
    • Batch processing for some types of analysis
    • Streaming for other analyses
    • Hybrid team
    • Data Scientists: more familiar with Python
    • Software Engineers: Java

    View Slide

  6. Architecture - Real time platform
    • New use cases are implemented by Data Scientists all the time
    • Need the implementations to be independent from each other
    • One Spark Streaming job per use case
    • Microservice-inspired architecture
    • Diamond-shaped
    • Upstream jobs are written in Scala
    • Core is made of multiple Python jobs, one per use case
    • Downstream jobs are written in Scala
    • Plumbing between the jobs → Kafka
    1/2

    View Slide

  7. Architecture - Real time platform 2/2

    View Slide

  8. Modularity
    • One Spark job per use case
    • Hot deployments: can roll out new use cases (= new jobs) without
    stopping existing jobs
    • Can roll out updated code without affecting other jobs
    • Able to measure the resources consumed by a single job
    • Shared services are provided by upstream and
    downstream jobs

    View Slide

  9. Kafka - Reminder

    View Slide

  10. • Publish-subscribe messaging system
    • Originally written at LinkedIn, now an Apache project
    • Works in a cluster (one or more Kafka nodes)
    • ZooKeeper for cluster coordination
    • Can handle massive amounts of messages

    View Slide

  11. Kafka - Key features
    • Messages are organized by topics
    • Topics are divided into partitions
    • Order is preserved within a partition
    • Partitions are distributed within the cluster (horizontal scaling)
    • Partitions are replicated on multiple nodes (configurable)
    1/4

    View Slide

  12. Kafka - Key features
    • Messages are arrays of bytes
    • Avro recommended!
    • Kafka Registry recommended!
    • Messages have a key and a value
    • Key is optional
    • Key is often used to determine the partition the message is written to
    • Messages are persisted on the disk
    • Duration of retention is configurable
    2/4

    View Slide

  13. Kafka - Key features
    • Producers publish messages - Consumers read them
    • Consumers are responsible for keeping track of offsets
    • An offset is the position within a partition
    • Need to store the offsets reliably
    3/4

    View Slide

  14. Kafka - Key features
    • Consumers can be part of a consumer group
    • Each consumer within a consumer group is assigned some partitions
    • Reassignment of partitions occurs:
    • When a new consumer joins the group
    • When a consumer dies
    • Allows the consumers to balance the load
    4/4

    View Slide

  15. Kafka - Remember
    • Partitioning is key
    • It determines how fast you can push data to Kafka
    • It determines how many consumers can consume in parallel

    View Slide

  16. Kafka 101
    • See my post:
    • Kafka, Spark and Avro - Part 1, Kafka 101
    • aseigneurin.github.io/2016/03/02/kafka-spark-avro-kafka-101.html or
    bit.ly/kafka101

    View Slide

  17. Consuming Kafka messages with Spark Streaming
    (and why you probably shouldn’t do it)

    View Slide

  18. Spark + Kafka?
    • Spark has become the de-facto processing framework
    • Provides APIs for multiple programming languages
    • Python → Data Scientists
    • Scala/Java → Software Engineers
    • Supports batch jobs and streaming jobs, incl. support for Kafka…

    View Slide

  19. Consuming from Kafka
    • Connecting Spark to Kafka, 2 methods:
    • Receiver-based approach: not ideal for parallelism
    • Direct approach: better for parallelism but have to deal with Kafka
    offsets
    Spark +
    Kakfa
    problem
    s

    View Slide

  20. Dealing with Kafka offsets
    • Default: consumes from the end of the Kafka topic (or the
    beginning)
    • Documentation → Use checkpoints
    • Tasks have to be Serializable (not always possible: dependent libraries)
    • Harder to deploy the application (classes are serialized) → run a new instance
    in parallel and kill the first one (harder to automate; messages consumed
    twice)
    • Requires a shared file system (HDFS, S3) → big latency on these FS that
    forces to increase the micro-batch interval
    1/2
    Spark +
    Kakfa
    problem
    s

    View Slide

  21. Dealing with Kafka offsets
    • Dealing with Kafka offsets
    • Solution: deal with offsets in the Spark Streaming application
    • Write the offsets to a reliable storage: ZooKeeper, Kafka…
    • Write after processing the data
    • Read the offsets on startup (if no offsets, start from the end)
    • blog.ippon.tech/spark-kafka-achieving-zero-data-loss/
    2/2
    Spark +
    Kakfa
    problem
    s

    View Slide

  22. Micro-batches
    Spark streaming processes events in micro-batches
    • Impact on latency
    • Spark Streaming micro-batches → hard to achieve sub-second latency
    • See spark.apache.org/docs/latest/streaming-programming-guide.html#task-launching-overheads
    • Total latency of the system = sum of the latencies of each stage
    • In this use case, events are independent from each other - no need for windowing computation → a
    real streaming framework would be more appropriate
    • Impact on memory usage
    • Kafka+Spark using the direct approach = 1 RDD partition per Kafka partition
    • If you start the Spark with lots of unprocessed data in Kafka, RDD partitions can exceed the size of
    the memory
    Spark +
    Kakfa
    problem
    s

    View Slide

  23. Allocation of resources in Spark
    • With Spark Streaming, resources (CPU & memory) are allocated per job
    • Resources are allocated when the job is submitted and cannot be updated on the
    fly
    • Have to allocate 1 core to the Driver of the job → unused resource
    • Have to allocate extra resources to each job to handle variations in traffic →
    unused resources
    • For peak periods, easy to add new Spark Workers but jobs have to restarted
    • Idea to be tested:
    • Over allocation of real resources, e.g let Spark know it has 6 cores on a 4-cores server
    Spark +
    Kakfa
    problem
    s

    View Slide

  24. Python code in production
    • Data Scientists know Python → They can contribute
    • But shipping code written by Data Scientists is not ideal
    • Need production-grade code (error handling, logging…)
    • Code is less tested than Scala code
    • Harder to deploy than a JAR file → Python Virtual Environments
    • blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache-
    hadoop-cluster-for-pyspark-jobs/
    Spark +
    Kakfa
    problem
    s

    View Slide

  25. Resilience of Spark jobs
    • Spark Streaming application = 1 Driver + 1 Application
    • Application = N Executors
    • If an Executor dies → restarted (seamless)
    • If the Driver dies, the whole Application must be restarted
    • Scala/Java jobs → “supervised” mode
    • Python jobs → not supported with Spark Standalone
    Spark +
    Kakfa
    problem
    s

    View Slide

  26. Writing to Kafka
    • Spark Streaming comes with a library to read from Kafka
    but none to write to Kafka!
    • Flink or Kafka Streams do that out-of-the-box
    • Cloudera provides an open-source library:
    • github.com/cloudera/spark-kafka-writer
    • (Has been removed by now!)
    Spark +
    Kakfa
    problem
    s

    View Slide

  27. Kafka Streams

    View Slide

  28. Kafka Streams
    docs.confluent.io/3.1.0/streams/index.html
    • “powerful, easy-to-use library for building highly scalable, fault-tolerant, distributed
    stream processing applications on top of Apache Kafka”
    • Works with Kafka ≥ 0.10
    • No cluster needed: Kafka is the cluster manager (consumer groups)
    • Natively consumes messages from Kafka (and handles offsets)
    • Natively pushes produced messages to Kafka
    • Processes messages one at a time → low latency, low footprint

    View Slide

  29. • Read text from a topic
    • Process the text:
    • Only keep messages containing the “a” character
    • Capitalize the text
    • Output the result to another topic
    Quick example 1/4

    View Slide

  30. • Create a regular Java application (with a main)
    • Add the Kafka Streams dependency:

    org.apache.kafka

    kafka-streams

    0.10.1.0


    • Add the Kafka Streams code (next slide)
    • Build and run the JAR
    Quick example 2/4

    View Slide

  31. Properties props = new Properties();

    props.put(StreamsConfig.APPLICATION_ID_CONFIG, “text-transformer");

    props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");

    props.put(StreamsConfig.ZOOKEEPER_CONNECT_CONFIG, "localhost:2181");

    props.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, "8");

    props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

    KStreamBuilder builder = new KStreamBuilder();

    builder.stream(Serdes.String(), Serdes.String(), "text-input")

    .filter((key, value) -> value.contains("a"))

    .mapValues(text -> text.toUpperCase())

    .to(Serdes.String(), Serdes.String(), "text-output");


    KafkaStreams streams = new KafkaStreams(builder, props);

    streams.start();
    Quick example 3/4
    • Application ID = Kafka
    consumer group
    • Threads for parallel processing
    (relates to partitions)
    • Topic to read from + key/
    value deserializers
    • Transformations: map, filter…
    • Topic to write to + key/value
    serializers

    View Slide

  32. Quick example
    • Create topics:
    • kafka-topics --zookeeper localhost:2181 --create --topic text-input
    --partitions 1 --replication-factor 1
    • kafka-topics --zookeeper localhost:2181 --create --topic text-output
    --partitions 1 --replication-factor 1
    • Launch a consumer to display the output:
    • kafka-console-consumer --zookeeper localhost:2181 --topic text-output
    • Launch a producer and type some text:
    • kafka-console-producer --broker-list localhost:9092 --topic text-
    input
    4/4

    View Slide

  33. Processor topology
    • Need to define one or more processor
    topologies
    • Two APIs to define topologies:
    • DSL (preferred): map(), filter(), to()…
    • Processor API (low level): implement the
    Processor interface then connect source processors,
    stream processors and sink processors together

    View Slide

  34. Parallelism (one process)
    • Kafka Streams creates 1 task per partition in the input topic
    • A task is an instance of the topology
    • Tasks are independent from each other
    • The number of processing threads is determined by the
    developer
    props.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, "8");
    • Tasks are distributed between threads
    1/2

    View Slide

  35. Parallelism (one process)
    • 3 partitions → 3 tasks
    • The tasks are distributed
    to the 2 threads
    2/2

    View Slide

  36. Parallelism (multiple processes)
    • With multiple processes (multiple instances of the JVM),
    each consumer process is assigned a portion of the
    partitions
    → Consumer group
    • Reassignment of partitions occurs:
    • When a new consumer joins the group
    • When a consumer dies
    → Tasks are created/deleted accordingly
    1/2

    View Slide

  37. Parallelism (multiple processes)
    • Partitions are assigned to
    2 consumers
    • 3 partitions → 3 tasks
    (as before)
    • Each thread has one task
    → Improved parallelism
    2/2

    View Slide

  38. KStream vs KTable
    KStream is a stream of records
    • Records are independent from each other
    • (Do not use log compaction)
    Example:

    KStreamBuilder builder = new KStreamBuilder();

    KStream stream = builder.stream(Serdes.String(),
    Serdes.String(), "input-topic");
    Example (inspired from the
    documentation):
    • Sum values as records arrive
    • Records:
    • (alice, 1) = 1
    • (charlie, 1) = 2
    • (alice, 3) = 5
    • → Adds to (alice, 1)

    View Slide

  39. KStream vs KTable
    KTable is a change log stream
    • New records with the same key are an
    update of previously received records
    for the same key
    • Keys are required
    • Requires a state store
    Example:
    KStreamBuilder builder = new KStreamBuilder();

    KTable table = builder.table(Serdes.String(),
    Serdes.String(), "input-topic", "store-name");
    Example (inspired from the
    documentation):
    • Sum values as records arrive
    • Records:
    • (alice, 1) = 1
    • (charlie, 1) = 2
    • (alice, 3) = 4
    • → Replaces (alice, 1)

    View Slide

  40. map / mapValues Apply a transformation to the records
    flatMap / flatMapValues
    Apply a transformation to the records and create 0/1/n
    records per input record
    filter Apply a predicate
    groupBy / groupByKey
    Group the records. Followed by a call to reduce, aggregate or
    count
    join / leftJoin / outerJoin Windowed joins 2 KStreams / KTables
    to Writes the records to a Kafka topic
    through
    Writes the records to a Kafka topic and builds a new
    KStream / KTable from this topic
    API 1/2

    View Slide

  41. State stores
    • Some operations require to store a state
    • KTables (by definition, they need to keep previously received values)
    • Aggregations (groupBy / groupByKey)
    • Windowing operations
    • One state store per task (RocksDB or a hash map)
    • Backed by internal topics for recovery → fault tolerance

    View Slide

  42. Deploying and running
    • Assemble a JAR (maven-shade plugin)
    • Run the JAR as a regular Java application (java -cp …)
    • Make sure all instances are in the same consumer group
    (same application ID)

    View Slide

  43. Running
    Topic “AUTH-JSON” with 4 partitions
    Application ID = “auth-converter”
    Log on the first instance:
    11:00:22,331 ...AbstractCoordinator - Successfully joined group auth-converter with generation 1
    11:00:22,332 ...ConsumerCoordinator - Setting newly assigned partitions [AUTH_JSON-2, AUTH_JSON-1,
    AUTH_JSON-3, AUTH_JSON-0] for group auth-converter

    View Slide

  44. Running - Scaling up
    Start a new instance:
    Log on the first instance:
    11:01:31,402 ...AbstractCoordinator - Successfully joined group auth-converter with generation 2
    11:01:31,404 ...ConsumerCoordinator - Setting newly assigned partitions [AUTH_JSON-2, AUTH_JSON-3] for group
    auth-converter
    11:01:31,390 ...ConsumerCoordinator - Revoking previously assigned partitions [AUTH_JSON-2, AUTH_JSON-1,
    AUTH_JSON-3, AUTH_JSON-0] for group auth-converter
    11:01:31,401 ...ConsumerCoordinator - Setting newly assigned partitions [AUTH_JSON-1, AUTH_JSON-0] for group
    auth-converter

    View Slide

  45. Running - Scaling down
    Kill one of the instances
    Log on the remaining instance:
    11:02:13,410 ...ConsumerCoordinator - Revoking previously assigned partitions [AUTH_JSON-1, AUTH_JSON-0] for
    group auth-converter
    11:02:13,415 ...ConsumerCoordinator - Setting newly assigned partitions [AUTH_JSON-2, AUTH_JSON-1,
    AUTH_JSON-3, AUTH_JSON-0] for group auth-converter

    View Slide

  46. Delivery semantics
    • At least once
    • No messages will be lost
    • Messages can be processed a second time when failure happens
    → Make your system idempotent

    View Slide

  47. Kafka Streams on this project

    View Slide

  48. Migration
    • Conversion of Spark / Scala code
    • Upgraded from Scala 2.10 to 2.11 and enabled the -Xexperimental flag of
    the Scala compiler so that Scala lambdas are converted into Java lambdas
    (SAM support)
    • Removed lots of specific code to read from / write to Kafka (supported
    out-of-the-box with Kafka Streams)
    • API similar to the RDD API → Very straightforward conversion (no need to
    call foreachRDD, so even better!)
    • Conversion of Spark / Python code: not attempted

    View Slide

  49. Metrics
    • Kafka Streams doesn’t have a UI to display metrics (e.g. number of records
    processed)
    • Used Dropwizard Metrics (metrics.dropwizard.io)
    • Java API to calculate metrics and send them to various sinks
    • Used InfluxDB to store the metrics
    • Graphite compatible
    • Used Grafana to display the metrics as graphs
    • Each instance reports its own metrics → Need to aggregate metrics

    View Slide

  50. View Slide

  51. Metrics aggregation
    • Specific reporter to send Dropwizard Metrics to a Kafka topic
    • Kafka topic to collect metrics
    • 1 partition
    • Key = instance ID (e.g. app-1, app-2…)
    • Value = monotic metric
    • Kafka Streams app to aggregate metrics
    • Input is a KTable (new values replace previous values)
    • Send aggregated metrics to InfluxDB

    View Slide

  52. Kafka Streams app to aggregate metrics
    KTable metricsStream = builder.table(appIdSerde, metricSerde, "metrics", "raw-metrics");

    KStream metricValueStream = metricsStream

    .groupBy((key, value) -> new KeyValue<>(value.getName(), value), metricNameSerde, metricSerde)

    .reduce(CounterMetric::add, CounterMetric::subtract, "aggregates")

    .toStream()
    .to(metricNameSerde, metricSerde, "metrics-agg");


    // --- Second topology


    GraphiteReporter graphite = GraphiteReporter.builder()

    .hostname("localhost")

    .port(2003)

    .build();


    KStream aggMetricsStream = builder.stream(metricNameSerde, metricSerde, "metrics-agg");

    aggMetricsStream.foreach((key, metric) -> graphite.send(metric));

    View Slide

  53. Metrics
    Aggregated metric

    View Slide

  54. Send data into Kafka (1M records)
    Start consumer 1
    Start consumer 2
    Aggregated metric (from consumers 1 and 2)
    Stop consumer 2
    Delta = records processed twice

    View Slide

  55. Results
    • Pros
    • Simpler code (no manual handling of offsets)
    • Simpler packaging (no dependencies to exclude, less dependency version conflicts)
    • Much lower latency: from seconds to milliseconds
    • Reduced memory footprint
    • Easier scaling
    • Improved stability when restarting the application
    • Cons
    • No UI
    • No centralized logs → Use ELK or equivalent…
    • No centralized metrics → Aggregate metrics
    • Have to know on which servers the application is running

    View Slide

  56. Summary
    &
    Conclusion

    View Slide

  57. Summary
    • Very easy to build pipelines on top of Kafka
    • Great fit for micro-services
    • Compared to Spark Streaming:
    • Better for realtime apps than Spark Streaming
    • Lower latency, lower memory footprint, easier scaling
    • Lower level: good for prod, lacks a UI for dev
    • Compared to a standard Kafka consumer:
    • Higher level: faster to build a sophisticated app
    • Less control for very fine-grained consumption

    View Slide

  58. Thank you!
    @aseigneurin

    View Slide