Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SF Kafka Summit 2019: Cross the Streams Thanks to Kafka and Flink

SF Kafka Summit 2019: Cross the Streams Thanks to Kafka and Flink

The last 5 years, Kafka and Flink have become mature technologies that have allowed us to embrace the streaming paradigm. You can bet on them to build reliable and efficient applications. They are active projects backed by companies using them in production. They have a good community contributing, and sharing experience and knowledge. Kafka and Flink are solid choices if you want to build a data platform that your data scientists or developers can use to collect, process, and distribute data. You can put together Kafka Connect, Kafka, and Flink. First, you will take care of their deployment. Then, for each case, you will setup each part, and of course develop the Flink job so it can integrate easily with the rest. Looks like a challenging but exciting project, isn’t it? In this session, you will learn how you can build such data platform, what are the nitty-gritty of each part, how you can plug them together, in particular how to plug Flink in the Kafka ecosystem, what are the common pitfalls to avoid, and what it requires to be deployed on kubernetes. Even if you are not familiar with all the technologies, there will be enough introduction so you can follow. Come and learn how we can actually cross the streams!

Code example: https://github.com/toch/sf-kafka-summit-2019/

Christophe Philemotte

September 30, 2019
Tweet

More Decks by Christophe Philemotte

Other Decks in Programming

Transcript

  1. 10+ years ago: DIY 2000 2010 Stream Processing Papers Stratosphere

    Kafka Lambda architecture Kappa architecture
  2. 10+ years ago: DIY Resource Management? Collection? Distribution? Consistency? Planning

    DAG? Distributed Processing? Only Once Semantic? Stateful Continuous Streaming?
  3. • Sandbox: toch/sf-kafka-summit-2019 • Lessons From Production • Integration How-to

    of Kafka and Flink Ecosystems • SQL Streaming How-to Agenda Takeaways
  4. ⚠Warning ➔ Stateful & Stateless Pods ➔ Service Dependencies ➔

    Storage Type ➔ JVM Heap & Container Memory Agenda Kafka & Co.
  5. ⚠Warning ➔ Stateful & Stateless Pods → k8s operator ➔

    Service Dependencies ➔ Storage Type ➔ JVM Heap & Container Memory Agenda Kafka & Co.
  6. ⚠Warning ➔ Stateful & Stateless Pods → k8s operator ➔

    Service Dependencies → init container ➔ Storage Type ➔ JVM Heap & Container Memory Agenda Kafka & Co.
  7. ⚠Warning ➔ Stateful & Stateless Pods → k8s operator ➔

    Service Dependencies → init container ➔ Storage Type → persistent volume storage class ➔ JVM Heap & Container Memory Agenda Kafka & Co.
  8. ⚠Warning ➔ Stateful & Stateless Pods → k8s operator ➔

    Service Dependencies → init container ➔ Storage Type → persistent volume storage class ➔ JVM Heap & Container Memory → mem. limits & requests Agenda Kafka & Co.
  9. ⚠Warning ➔ JVM Heap & RocksDB Memory & Container Memory

    ➔ State Backend ➔ HA Setup ➔ Rootless Container with random UID Flink
  10. ⚠Warning ➔ JVM Heap & RocksDB Memory & Container Memory

    → explicit allocation ➔ State Backend ➔ HA Setup ➔ Rootless Container with random UID Flink
  11. ⚠Warning ➔ JVM Heap & RocksDB Memory & Container Memory

    → explicit allocation ➔ State Backend → e.g. HDFS ➔ HA Setup ➔ Rootless Container with random UID Flink
  12. ⚠Warning ➔ JVM Heap & RocksDB Memory & Container Memory

    → explicit allocation ➔ State Backend → e.g. HDFS ➔ HA Setup → e.g. HDFS ➔ Rootless Container with random UID Flink
  13. ⚠Warning ➔ JVM Heap & RocksDB Memory & Container Memory

    → explicit allocation ➔ State Backend → e.g. HDFS ➔ HA Setup → e.g. HDFS ➔ Rootless Container with random UID → Build your own Docker Image Flink
  14. ⚠Warning ➔ Stateful Pod & replication → k8s operator ➔

    Storage Type → fast without cache pv storage class PostgreSQL
  15. ghosts id: INT name: TEXT 2 Slimer movies id: INT

    name: TEXT year: INT 1 Ghostbusters 1984 2 Ghostbusters II 1989 ghosts_in_movies ghost_id: INT movie_id: INT id: INT 2 1 2 2 2 3 Seed the data into PostgreSQL (seed.sql)
  16. ⚠Warning: ➔ Rebalancing at each Config Update (fixed in 2.3)

    ➔ Connector JARs Location Kafka Connect Setup
  17. ⚠Warning: ➔ Rebalancing at each Config Update (fixed in 2.3)

    → upgrade ➔ Connector JARs Location Kafka Connect Setup
  18. ⚠Warning: ➔ Rebalancing at each Config Update (fixed in 2.3)

    → upgrade ➔ Connector JARs Location → Deps JARs close to plugin JAR Kafka Connect Setup
  19. Main Lessons • Memory Allocation • Rootless container • Stateful

    vs Stateless • Throughput vs Latency Main lessons
  20. ghosts id: INT name: TEXT 2 Slimer movies id: INT

    name: TEXT year: INT 1 Ghostbusters 1984 2 Ghostbusters II 1989 ghosts_in_movies ghost_id: INT movie_id: INT id: INT 2 1 2 2 2 3 Query Result Example
  21. Job

  22. Job

  23. Job

  24. Job

  25. • Transparent State Management • Transparent Fault-Tolerance • Undocumented Rich

    Test Helpers • Required Topic Main Lessons Main lessons
  26. • JVM & RocksDB Memory in Docker Container • K8s

    Stateful Pod and Scaling • Service Dependencies (init container) Main Lessons Operations Pitfalls
  27. • Rebalancing • JARs Location • Development Environment docker run

    \ --net=host lensesio/fast-data-dev Main Lessons Kafka Connect Pitfalls
  28. What have we achieved? Kafka Connect JDBC Source Rss Mgt

    SQL Query & Fault Tolerance State Mgt
  29. • Global Window for INNER JOIN • ANSI SQL •

    Partitioning Freedom Main Lessons Why not KSQL?