• In the US: offices in DC, NYC and Richmond, Virginia • Digital, Big Data and Cloud applications • Java & Agile expertise • Open-source projects: JHipster, Tatami, etc. • @ipponusa
to the customer on their data • High volume of data • 25 millions records per day (average) • Need to keep at least 60 days of history = 1.5 Billion records • Seasonal peaks... • Need an hybrid platform • Batch processing for some types of analysis • Streaming for other analyses • Hybrid team • Data Scientists: more familiar with Python • Software Engineers: Java
implemented by Data Scientists all the time • Need the implementations to be independent from each other • One Spark Streaming job per use case • Microservice-inspired architecture • Diamond-shaped • Upstream jobs are written in Scala • Core is made of multiple Python jobs, one per use case • Downstream jobs are written in Scala • Plumbing between the jobs → Kafka 1/2
deployments: can roll out new use cases (= new jobs) without stopping existing jobs • Can roll out updated code without affecting other jobs • Able to measure the resources consumed by a single job • Shared services are provided by upstream and downstream jobs
• Topics are divided into partitions • Order is preserved within a partition • Partitions are distributed within the cluster (horizontal scaling) • Partitions are replicated on multiple nodes (configurable) 1/4
• Avro recommended! • Kafka Registry recommended! • Messages have a key and a value • Key is optional • Key is often used to determine the partition the message is written to • Messages are persisted on the disk • Duration of retention is configurable 2/4
read them • Consumers are responsible for keeping track of offsets • An offset is the position within a partition • Need to store the offsets reliably 3/4
a consumer group • Each consumer within a consumer group is assigned some partitions • Reassignment of partitions occurs: • When a new consumer joins the group • When a consumer dies • Allows the consumers to balance the load 4/4
framework • Provides APIs for multiple programming languages • Python → Data Scientists • Scala/Java → Software Engineers • Supports batch jobs and streaming jobs, incl. support for Kafka…
• Receiver-based approach: not ideal for parallelism • Direct approach: better for parallelism but have to deal with Kafka offsets Spark + Kakfa problem s
of the Kafka topic (or the beginning) • Documentation → Use checkpoints • Tasks have to be Serializable (not always possible: dependent libraries) • Harder to deploy the application (classes are serialized) → run a new instance in parallel and kill the first one (harder to automate; messages consumed twice) • Requires a shared file system (HDFS, S3) → big latency on these FS that forces to increase the micro-batch interval 1/2 Spark + Kakfa problem s
Solution: deal with offsets in the Spark Streaming application • Write the offsets to a reliable storage: ZooKeeper, Kafka… • Write after processing the data • Read the offsets on startup (if no offsets, start from the end) • blog.ippon.tech/spark-kafka-achieving-zero-data-loss/ 2/2 Spark + Kakfa problem s
latency • Spark Streaming micro-batches → hard to achieve sub-second latency • See spark.apache.org/docs/latest/streaming-programming-guide.html#task-launching-overheads • Total latency of the system = sum of the latencies of each stage • In this use case, events are independent from each other - no need for windowing computation → a real streaming framework would be more appropriate • Impact on memory usage • Kafka+Spark using the direct approach = 1 RDD partition per Kafka partition • If you start the Spark with lots of unprocessed data in Kafka, RDD partitions can exceed the size of the memory Spark + Kakfa problem s
(CPU & memory) are allocated per job • Resources are allocated when the job is submitted and cannot be updated on the fly • Have to allocate 1 core to the Driver of the job → unused resource • Have to allocate extra resources to each job to handle variations in traffic → unused resources • For peak periods, easy to add new Spark Workers but jobs have to restarted • Idea to be tested: • Over allocation of real resources, e.g let Spark know it has 6 cores on a 4-cores server Spark + Kakfa problem s
They can contribute • But shipping code written by Data Scientists is not ideal • Need production-grade code (error handling, logging…) • Code is less tested than Scala code • Harder to deploy than a JAR file → Python Virtual Environments • blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache- hadoop-cluster-for-pyspark-jobs/ Spark + Kakfa problem s
Driver + 1 Application • Application = N Executors • If an Executor dies → restarted (seamless) • If the Driver dies, the whole Application must be restarted • Scala/Java jobs → “supervised” mode • Python jobs → not supported with Spark Standalone Spark + Kakfa problem s
to read from Kafka but none to write to Kafka! • Flink or Kafka Streams do that out-of-the-box • Cloudera provides an open-source library: • github.com/cloudera/spark-kafka-writer • (Has been removed by now!) Spark + Kakfa problem s
scalable, fault-tolerant, distributed stream processing applications on top of Apache Kafka” • Works with Kafka ≥ 0.10 • No cluster needed: Kafka is the cluster manager (consumer groups) • Natively consumes messages from Kafka (and handles offsets) • Natively pushes produced messages to Kafka • Processes messages one at a time → low latency, low footprint
Add the Kafka Streams dependency: <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>0.10.1.0</version> </dependency> • Add the Kafka Streams code (next slide) • Build and run the JAR Quick example 2/4
topologies • Two APIs to define topologies: • DSL (preferred): map(), filter(), to()… • Processor API (low level): implement the Processor interface then connect source processors, stream processors and sink processors together
partition in the input topic • A task is an instance of the topology • Tasks are independent from each other • The number of processing threads is determined by the developer props.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, "8"); • Tasks are distributed between threads 1/2
the JVM), each consumer process is assigned a portion of the partitions → Consumer group • Reassignment of partitions occurs: • When a new consumer joins the group • When a consumer dies → Tasks are created/deleted accordingly 1/2
Records are independent from each other • (Do not use log compaction) Example: KStreamBuilder builder = new KStreamBuilder(); KStream<String, String> stream = builder.stream(Serdes.String(), Serdes.String(), "input-topic"); Example (inspired from the documentation): • Sum values as records arrive • Records: • (alice, 1) = 1 • (charlie, 1) = 2 • (alice, 3) = 5 • → Adds to (alice, 1)
New records with the same key are an update of previously received records for the same key • Keys are required • Requires a state store Example: KStreamBuilder builder = new KStreamBuilder(); KTable<String, String> table = builder.table(Serdes.String(), Serdes.String(), "input-topic", "store-name"); Example (inspired from the documentation): • Sum values as records arrive • Records: • (alice, 1) = 1 • (charlie, 1) = 2 • (alice, 3) = 4 • → Replaces (alice, 1)
/ flatMapValues Apply a transformation to the records and create 0/1/n records per input record filter Apply a predicate groupBy / groupByKey Group the records. Followed by a call to reduce, aggregate or count join / leftJoin / outerJoin Windowed joins 2 KStreams / KTables to Writes the records to a Kafka topic through Writes the records to a Kafka topic and builds a new KStream / KTable from this topic API 1/2
• KTables (by definition, they need to keep previously received values) • Aggregations (groupBy / groupByKey) • Windowing operations • One state store per task (RocksDB or a hash map) • Backed by internal topics for recovery → fault tolerance
Log on the first instance: 11:00:22,331 ...AbstractCoordinator - Successfully joined group auth-converter with generation 1 11:00:22,332 ...ConsumerCoordinator - Setting newly assigned partitions [AUTH_JSON-2, AUTH_JSON-1, AUTH_JSON-3, AUTH_JSON-0] for group auth-converter
on the remaining instance: 11:02:13,410 ...ConsumerCoordinator - Revoking previously assigned partitions [AUTH_JSON-1, AUTH_JSON-0] for group auth-converter 11:02:13,415 ...ConsumerCoordinator - Setting newly assigned partitions [AUTH_JSON-2, AUTH_JSON-1, AUTH_JSON-3, AUTH_JSON-0] for group auth-converter
from Scala 2.10 to 2.11 and enabled the -Xexperimental flag of the Scala compiler so that Scala lambdas are converted into Java lambdas (SAM support) • Removed lots of specific code to read from / write to Kafka (supported out-of-the-box with Kafka Streams) • API similar to the RDD API → Very straightforward conversion (no need to call foreachRDD, so even better!) • Conversion of Spark / Python code: not attempted
metrics (e.g. number of records processed) • Used Dropwizard Metrics (metrics.dropwizard.io) • Java API to calculate metrics and send them to various sinks • Used InfluxDB to store the metrics • Graphite compatible • Used Grafana to display the metrics as graphs • Each instance reports its own metrics → Need to aggregate metrics
offsets) • Simpler packaging (no dependencies to exclude, less dependency version conflicts) • Much lower latency: from seconds to milliseconds • Reduced memory footprint • Easier scaling • Improved stability when restarting the application • Cons • No UI • No centralized logs → Use ELK or equivalent… • No centralized metrics → Aggregate metrics • Have to know on which servers the application is running
Kafka • Great fit for micro-services • Compared to Spark Streaming: • Better for realtime apps than Spark Streaming • Lower latency, lower memory footprint, easier scaling • Lower level: good for prod, lacks a UI for dev • Compared to a standard Kafka consumer: • Higher level: faster to build a sophisticated app • Less control for very fine-grained consumption