Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From Messaging to Logs with Apache Kafka - OUGN17

From Messaging to Logs with Apache Kafka - OUGN17

Presented at OUGN 2017

Jorge Quilcate

March 10, 2017
Tweet

More Decks by Jorge Quilcate

Other Decks in Technology

Transcript

  1. #ougn17 About me Jorge Quilcate Otoya Back-end/Integration Developer at Sysco

    Middleware @jeqo89 | github.com/jeqo | jeqo.github.io
  2. #ougn17 “Technology that enables asynchronous communication … Channels, also known

    as queues, are the logical pathways that connect the programs and convey messages … A sender or producer is a program that sends a message by writing the message to a channel A receiver or consumer is a program that receives a message by reading (and deleting) it from a channel.” Context: Messaging Enterprise Integration Patterns - Gregor Hohpe and Bobby Woolf http://www.enterpriseintegrationpatterns.com/patterns/messaging/Introduction.html
  3. #ougn17 Context: Logs Records appended to the end of the

    Log... Each record has a Key… Records are ordered… Order defines a notion of “time”... Content is not important at this point, could be anything … They records what happened and when. The Log: What every software engineer should know about real-time data's unifying abstraction - Jay Kreps https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  4. #ougn17 Logs… Logs everywhere How does your database store data

    on disk reliably? It uses a log. How does one database replica synchronise with another replica? It uses a log. How does activity data get recorded in a system like Apache Kafka? It uses a log. How will the data infrastructure of your application remain robust at scale? Guess what… Using logs to build a solid data infrastructure (or why dual writes are a bad idea) - Martin Kleppmann https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/ https://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/
  5. #ougn17 Log-Centric Architecture (a.k.a. Kappa) “A system that assumes an

    external log is present allows the individual systems to relinquish a lot of their own complexity and rely on the shared log.” The Log: What every software engineer should know about real-time data's unifying abstraction - Jay Kreps https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying http://milinda.pathirage.org/kappa-architecture.com/
  6. #ougn17 Messaging use-case: Job Queues Fire and Forget Store and

    Forward (a.k.a. Push Model) Broker in charge of the delivery Event sourcing and stream processing at scale - Martin Kleppmann https://martin.kleppmann.com/2016/01/29/event-sourcing-stream-proce ssing-at-ddd-europe.html Implementations: JMS/AMQP
  7. #ougn17 Messaging Challenges Out-of-order when messages are retried Risk of

    inconsistencies in different clients (producers and/or consumers)
  8. #ougn17 Logs use-case: Event Log Pull Model Ordered stream of

    Events Consumers in control of message consumption Event sourcing and stream processing at scale - Martin Kleppmann https://martin.kleppmann.com/2016/01/29/event-sourcing-stream-process ing-at-ddd-europe.html Implementations: Apache Kafka, Amazon Kinesis Streams, Apache DistributedLog (incubating - Twitter)
  9. #ougn17 Apache Kafka: Facts ➔ Born from necessity to solve

    the data pipeline problem in LinkedIn. ➔ First use-cases: Collectings system metrics and User’s activity monitoring. 2010: Open-sourced 2011: Apache project 2012: Graduated from incubator in October 2014: Confluent Inc. founded Kafka: The Definitive Guide - Neha Narkhede, Gwen Shapira & Todd Palino
  10. #ougn17 Apache Kafka: Use-cases ➔ Activity Tracking ➔ Messaging ➔

    Metrics/Logging ➔ Commit Log ➔ Change Data Capture (CDC) ➔ Stream Processing ➔ Cloud Adoption ➔ …
  11. #ougn17 Apache Kafka Tour (v0.10.2.0) Kafka Cluster Log Records Kafka

    Producer API Kafka Consumer API Kafka Streams API Kafka Connect API Kafka ++
  12. #ougn17 Centralized coordination service: consensus, group management, presence protocols, atomic

    broadcast Kafka’s internal “source of truth” Used for: ➔ Master election ➔ Replica propagation (ISR) ➔ And more Kafka Topology: Why Zookeeper? Distributed Consensus Reloaded: Apache Zookeeper and Replication in Kafka - Flavio Junqueira https://www.confluent.io/blog/distributed-consensus-reloaded-apache-zookeeper-and-replication-in-kafka/
  13. #ougn17 Balance Availability and Consistency Use case #1 Activity Tracking

    ➔ Retention: 3 days ➔ More Partitions ➔ Less Replication Factor ➔ Availability is most important Use case #2 Inventory adjustments ➔ Retention: 6 months ➔ Less Partitions ➔ More Replication Factor ➔ Consistency is most important Streaming in Practice: Putting Kafka in Production - Roger Hoover https://www.confluent.io/apache-kafka-talk-series/Streaming-in-Practice-Putting-Kafka-in-Production/
  14. #ougn17 Schema Evolution: Why Avro? Reader’s schema and writer’s schema

    does not have to be the same Forward/Backward compatibility ➔ Add/remove fields with default values ➔ Explicit `null` type (no optional or required markers) ➔ Change data types ➔ Change names (i.e. alias) Designing Data-Intensive Applications - Martin Kleppmann
  15. #ougn17 Lab: Kafka Cluster Scalability: Cluster and Brokers Topics: Partitions,

    Replication, ISR Cleaning up: Compaction and Retention
  16. #ougn17 Acknowledgment: Latency vs Durability Ack=all (-1) → 2 network

    round-trip → no data loss (in combination with `min.insync.replicas`)
  17. #ougn17 ➔ Consumer Groups as Logical Subscribers ➔ Offset by

    Consumer instance (group member) ➔ Consumer Groups as base of parallelism, with Partitions ➔ Ordering ensured by partition (+ keyed topics is normally enough) Multiple Consumers
  18. #ougn17 At-Most-Once Delivery ➔ Scenario the consumer process crashes after

    saving its position but before saving the output of its message processing. ➔ Result In this case the process that took over processing would start at the saved position even though a few messages prior to that position had not been processed.
  19. #ougn17 At-Least-Once Delivery ➔ Scenario the consumer process crashes after

    processing messages but before saving its position. ➔ Result In this case when the new process takes over the first few messages it receives will already have been processed.
  20. #ougn17 Exactly-Once Delivery “Exactly-once delivery requires co-operation with the destination

    storage system …” Coming soon (KIP-98/KIP-129): • Idempotent Producer Guarantees • Transactional Guarantees • Streams Exactly-Once semantics
  21. #ougn17 Lab: Kafka Consumer Consumer Groups: Parallelism Rewind Offsets: Control

    and reprocessing (https://jeqo.github.io/post/2017-01-31-kafka-rewind-consumers-offset/)
  22. #ougn17 Kafka Streams API & Kafka Connector API Unifying Stream

    Processing and Interactive Queries in Apache Kafka - Eno Thereska https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/
  23. #ougn17 Kafka Connect HDFS, JDBC, GoldenGate, Elasticsearch, Couchbase, DataStax, Cassandra,

    Attunity, Azure IoTHub, SAP Hana, VoltDb, FTP, JMS, JMX, MongoDB, Solr, Splunk, RethinkDB, SQS, S3, MQTT, Redis, InfluxDB, HBase, Hazelcast, Twitter, and more...
  24. #ougn17 Lab: Kafka Streams & Kafka Connector Twitter/File Connectors “Simplified

    Consumer” Stream/Table Duality Stateful processing (Time Window)*
  25. #ougn17 Integration with Kafka Integration Platforms: ➔ Camel http://camel.apache.org/kafka.html ➔

    Akka Streams http://doc.akka.io/docs/akka-stream-kafka/current/home.html ➔ Oracle Service Bus http://www.ateam-oracle.com/osb-transport-for-apache-kafka-part-1/
  26. #ougn17 What’s in discussion and/or coming soon? Exactly-once Delivery /

    Txn Messaging (adopted - wip) https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional +Messaging Headers support (additional metadata) (vote) https://cwiki.apache.org/confluence/display/KAFKA/KIP-82+-+Add+Record+Headers ZStandard Compression support (discussion) https://cwiki.apache.org/confluence/display/KAFKA/KIP-110%3A+Add+Codec+for+ZStandard+Compression Reset Offset tool (vote) https://cwiki.apache.org/confluence/display/KAFKA/KIP-122%3A+Add+a+tool+to+Reset+Consumer+Group+Of fsets https://cwiki.apache.org/confluence/display/KAFKA/ Kafka+Improvement+Proposals
  27. #ougn17 How NOT to use Kafka Top 5: ➔ No

    consideration of data on the inside vs outside ➔ Schema not externally defined ➔ Same config for every clients/topics ➔ 128 partitions as default ➔ Running on 8 overloaded nodes Kafka Summit 2016: 101 ways to config Kafka - Badly https://www.confluent.io/ kafka-summit-2016-101-ways-to-configure-kafka-badly https://cwiki.apache.org/confluence/display/KAFKA/Operations