From Messaging to Logs with Apache Kafka - OUGN17

#ougn17 messaging → logs @apachekafka Jorge Quilcate Otoya @jeqo89

#ougn17 About me Jorge Quilcate Otoya Back-end/Integration Developer at Sysco
Middleware @jeqo89 | github.com/jeqo | jeqo.github.io

#ougn17 Context

#ougn17 “Technology that enables asynchronous communication … Channels, also known
as queues, are the logical pathways that connect the programs and convey messages … A sender or producer is a program that sends a message by writing the message to a channel A receiver or consumer is a program that receives a message by reading (and deleting) it from a channel.” Context: Messaging Enterprise Integration Patterns - Gregor Hohpe and Bobby Woolf http://www.enterpriseintegrationpatterns.com/patterns/messaging/Introduction.html

#ougn17 Message Channels: Point-to-Point, Pub/Sub

#ougn17 Context: Logs Records appended to the end of the
Log... Each record has a Key… Records are ordered… Order defines a notion of “time”... Content is not important at this point, could be anything … They records what happened and when. The Log: What every software engineer should know about real-time data's unifying abstraction - Jay Kreps https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

#ougn17 Logs… Logs everywhere How does your database store data
on disk reliably? It uses a log. How does one database replica synchronise with another replica? It uses a log. How does activity data get recorded in a system like Apache Kafka? It uses a log. How will the data infrastructure of your application remain robust at scale? Guess what… Using logs to build a solid data infrastructure (or why dual writes are a bad idea) - Martin Kleppmann https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/ https://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/

#ougn17 Log-Centric Architecture (a.k.a. Kappa) “A system that assumes an
external log is present allows the individual systems to relinquish a lot of their own complexity and rely on the shared log.” The Log: What every software engineer should know about real-time data's unifying abstraction - Jay Kreps https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying http://milinda.pathirage.org/kappa-architecture.com/

#ougn17 Use Cases

#ougn17 Messaging use-case: Job Queues Fire and Forget Store and
Forward (a.k.a. Push Model) Broker in charge of the delivery Event sourcing and stream processing at scale - Martin Kleppmann https://martin.kleppmann.com/2016/01/29/event-sourcing-stream-proce ssing-at-ddd-europe.html Implementations: JMS/AMQP

#ougn17 Messaging Challenges Out-of-order when messages are retried Risk of
inconsistencies in different clients (producers and/or consumers)

#ougn17 Solving Messaging Challenges with Logs Ordering and Reprocessing

#ougn17 Logs use-case: Event Log Pull Model Ordered stream of
Events Consumers in control of message consumption Event sourcing and stream processing at scale - Martin Kleppmann https://martin.kleppmann.com/2016/01/29/event-sourcing-stream-process ing-at-ddd-europe.html Implementations: Apache Kafka, Amazon Kinesis Streams, Apache DistributedLog (incubating - Twitter)

#ougn17 Apache Kafka A Distributed Streaming Platform

#ougn17 Apache Kafka: Facts ➔ Born from necessity to solve
the data pipeline problem in LinkedIn. ➔ First use-cases: Collectings system metrics and User’s activity monitoring. 2010: Open-sourced 2011: Apache project 2012: Graduated from incubator in October 2014: Confluent Inc. founded Kafka: The Definitive Guide - Neha Narkhede, Gwen Shapira & Todd Palino

#ougn17 Apache Kafka: Use-cases ➔ Activity Tracking ➔ Messaging ➔
Metrics/Logging ➔ Commit Log ➔ Change Data Capture (CDC) ➔ Stream Processing ➔ Cloud Adoption ➔ …

#ougn17 Messaging Batch Database

#ougn17 Apache Kafka Tour (v0.10.2.0) Kafka Cluster Log Records Kafka
Producer API Kafka Consumer API Kafka Streams API Kafka Connect API Kafka ++

#ougn17 Kafka Core

#ougn17 Kafka Cluster

#ougn17 Centralized coordination service: consensus, group management, presence protocols, atomic
broadcast Kafka’s internal “source of truth” Used for: ➔ Master election ➔ Replica propagation (ISR) ➔ And more Kafka Topology: Why Zookeeper? Distributed Consensus Reloaded: Apache Zookeeper and Replication in Kafka - Flavio Junqueira https://www.confluent.io/blog/distributed-consensus-reloaded-apache-zookeeper-and-replication-in-kafka/

#ougn17 Balance Availability and Consistency Use case #1 Activity Tracking
➔ Retention: 3 days ➔ More Partitions ➔ Less Replication Factor ➔ Availability is most important Use case #2 Inventory adjustments ➔ Retention: 6 months ➔ Less Partitions ➔ More Replication Factor ➔ Consistency is most important Streaming in Practice: Putting Kafka in Production - Roger Hoover https://www.confluent.io/apache-kafka-talk-series/Streaming-in-Practice-Putting-Kafka-in-Production/

#ougn17

#ougn17 Log Record

#ougn17 from Topics to Partitions http://kafka.apache.org/documentation

#ougn17 from Partitions to Segments https://www.confluent.io/apache-kafka-talk-series/deep-dive-into-apache-kafka/ https://www.confluent.io/apache-kafka-talk-series/

#ougn17 from Segments to Records https://www.confluent.io/apache-kafka-talk-series/deep-dive-into-apache-kafka/ https://www.confluent.io/apache-kafka-talk-series/

#ougn17 Log Unit: Record https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol

#ougn17 Schema Evolution: Why Avro? Reader’s schema and writer’s schema
does not have to be the same Forward/Backward compatibility ➔ Add/remove fields with default values ➔ Explicit `null` type (no optional or required markers) ➔ Change data types ➔ Change names (i.e. alias) Designing Data-Intensive Applications - Martin Kleppmann

#ougn17 Lab: Kafka Cluster Scalability: Cluster and Brokers Topics: Partitions,
Replication, ISR Cleaning up: Compaction and Retention

#ougn17 Lab: Log Record Record Structure: Key/Value Serialization/Deserialization Metadata: Offset/Timestamp

#ougn17 Kafka Clients API

#ougn17 Kafka Clients survey https://www.confluent.io/blog/first-annual-state-apache-kafka-client-use-survey

#ougn17 Kafka Producer API

#ougn17 Batching and Compression

#ougn17 Acknowledgment: Latency vs Durability Ack=0 → No network delay
→ some data loss

#ougn17 Acknowledgment: Latency vs Durability Ack=1 → 1 network round-trip
→ few data loss

#ougn17 Acknowledgment: Latency vs Durability Ack=all (-1) → 2 network
round-trip → no data loss (in combination with `min.insync.replicas`)

#ougn17 Lab: Kafka Producer Batching and Compression Acknowledgements

#ougn17 Kafka Consumer API

#ougn17 ➔ Consumer Groups as Logical Subscribers ➔ Offset by
Consumer instance (group member) ➔ Consumer Groups as base of parallelism, with Partitions ➔ Ordering ensured by partition (+ keyed topics is normally enough) Multiple Consumers

#ougn17 At-Most-Once Delivery ➔ Scenario the consumer process crashes after
saving its position but before saving the output of its message processing. ➔ Result In this case the process that took over processing would start at the saved position even though a few messages prior to that position had not been processed.

#ougn17 At-Least-Once Delivery ➔ Scenario the consumer process crashes after
processing messages but before saving its position. ➔ Result In this case when the new process takes over the first few messages it receives will already have been processed.

#ougn17 Exactly-Once Delivery “Exactly-once delivery requires co-operation with the destination
storage system …” Coming soon (KIP-98/KIP-129): • Idempotent Producer Guarantees • Transactional Guarantees • Streams Exactly-Once semantics

#ougn17 Lab: Kafka Consumer Consumer Groups: Parallelism Rewind Offsets: Control
and reprocessing (https://jeqo.github.io/post/2017-01-31-kafka-rewind-consumers-offset/)

#ougn17 Kafka Streams API & Kafka Connector API

#ougn17 Kafka Streams API & Kafka Connector API Unifying Stream
Processing and Interactive Queries in Apache Kafka - Eno Thereska https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/

#ougn17 Kafka Streams https://twitter.com/lcrsilveira/status/829615803133730816 https://twitter.com/jessetanderson/status/830113106277785600

#ougn17 Kafka Connect HDFS, JDBC, GoldenGate, Elasticsearch, Couchbase, DataStax, Cassandra,
Attunity, Azure IoTHub, SAP Hana, VoltDb, FTP, JMS, JMX, MongoDB, Solr, Splunk, RethinkDB, SQS, S3, MQTT, Redis, InfluxDB, HBase, Hazelcast, Twitter, and more...

#ougn17 Lab: Kafka Streams & Kafka Connector Twitter/File Connectors “Simplified
Consumer” Stream/Table Duality Stateful processing (Time Window)*

#ougn17 Kafka++

#ougn17 Confluent Platform: Apache Kafka Enterprise Edition

#ougn17 Integration with Kafka Integration Platforms: ➔ Camel http://camel.apache.org/kafka.html ➔
Akka Streams http://doc.akka.io/docs/akka-stream-kafka/current/home.html ➔ Oracle Service Bus http://www.ateam-oracle.com/osb-transport-for-apache-kafka-part-1/

#ougn17 What’s in discussion and/or coming soon? Exactly-once Delivery /
Txn Messaging (adopted - wip) https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional +Messaging Headers support (additional metadata) (vote) https://cwiki.apache.org/confluence/display/KAFKA/KIP-82+-+Add+Record+Headers ZStandard Compression support (discussion) https://cwiki.apache.org/confluence/display/KAFKA/KIP-110%3A+Add+Codec+for+ZStandard+Compression Reset Offset tool (vote) https://cwiki.apache.org/confluence/display/KAFKA/KIP-122%3A+Add+a+tool+to+Reset+Consumer+Group+Of fsets https://cwiki.apache.org/confluence/display/KAFKA/ Kafka+Improvement+Proposals

#ougn17 How NOT to use Kafka Top 5: ➔ No
consideration of data on the inside vs outside ➔ Schema not externally defined ➔ Same config for every clients/topics ➔ 128 partitions as default ➔ Running on 8 overloaded nodes Kafka Summit 2016: 101 ways to config Kafka - Badly https://www.confluent.io/ kafka-summit-2016-101-ways-to-configure-kafka-badly https://cwiki.apache.org/confluence/display/KAFKA/Operations

#ougn17 Further reading

#ougn17 Thanks!!! Twitter: @jeqo89 GitHub: /jeqo Blog: jeqo.github.io Code: github.com/jeqo/talk-kafka-messaging-logs

From Messaging to Logs with Apache Kafka - OUGN17

From Messaging to Logs with Apache Kafka - OUGN17

More Decks by Jorge Quilcate

Other Decks in Technology

Featured

Transcript