SF Kafka Summit 2019: Cross the Streams Thanks to Kafka and Flink

SF Kafka Summit 2019: Cross the Streams Thanks to Kafka and Flink

The last 5 years, Kafka and Flink have become mature technologies that have allowed us to embrace the streaming paradigm. You can bet on them to build reliable and efficient applications. They are active projects backed by companies using them in production. They have a good community contributing, and sharing experience and knowledge. Kafka and Flink are solid choices if you want to build a data platform that your data scientists or developers can use to collect, process, and distribute data. You can put together Kafka Connect, Kafka, and Flink. First, you will take care of their deployment. Then, for each case, you will setup each part, and of course develop the Flink job so it can integrate easily with the rest. Looks like a challenging but exciting project, isn’t it? In this session, you will learn how you can build such data platform, what are the nitty-gritty of each part, how you can plug them together, in particular how to plug Flink in the Kafka ecosystem, what are the common pitfalls to avoid, and what it requires to be deployed on kubernetes. Even if you are not familiar with all the technologies, there will be enough introduction so you can follow. Come and learn how we can actually cross the streams!

Code example: https://github.com/toch/sf-kafka-summit-2019/

3f8fcddf7ab5d1bd90b0a0a9adfd6527?s=128

Christophe Philemotte

September 30, 2019
Tweet

Transcript

  1. None
  2. Cross the Streams thanks to Kafka and Flink Christophe Philemotte

  3. Hello, I am Christophe CTO, digazu toch @_toch ibakesoftware.com

  4. None
  5. Crossing the Streams

  6. Crossing the Streams Motivations

  7. 10+ years ago: DIY 2000 2010 Stream Processing Papers Stratosphere

    Kafka Lambda architecture Kappa architecture
  8. 10+ years ago: DIY

  9. 10+ years ago: DIY Resource Management? Collection? Distribution? Consistency? Planning

    DAG? Distributed Processing? Only Once Semantic? Stateful Continuous Streaming?
  10. None
  11. Cross the Tables!

  12. Today Today

  13. Ecosystem Integration? Deployment? Resource Scaling? Stateful Operation? Today

  14. Kafka Connect JDBC Source Our goal

  15. Our goal Run Any Query

  16. Agenda • Stack deployment • Flink Job • Outro Agenda

  17. • Sandbox: toch/sf-kafka-summit-2019 • Lessons From Production • Integration How-to

    of Kafka and Flink Ecosystems • SQL Streaming How-to Agenda Takeaways
  18. Crossing the Streams Stack deployment

  19. Kafka Connect JDBC Source

  20. Agenda Kafka & Co.

  21. Agenda Kafka & Co.

  22. Agenda Kafka & Co.

  23. ⚠Warning ➔ Stateful & Stateless Pods ➔ Service Dependencies ➔

    Storage Type ➔ JVM Heap & Container Memory Agenda Kafka & Co.
  24. ⚠Warning ➔ Stateful & Stateless Pods → k8s operator ➔

    Service Dependencies ➔ Storage Type ➔ JVM Heap & Container Memory Agenda Kafka & Co.
  25. ⚠Warning ➔ Stateful & Stateless Pods → k8s operator ➔

    Service Dependencies → init container ➔ Storage Type ➔ JVM Heap & Container Memory Agenda Kafka & Co.
  26. ⚠Warning ➔ Stateful & Stateless Pods → k8s operator ➔

    Service Dependencies → init container ➔ Storage Type → persistent volume storage class ➔ JVM Heap & Container Memory Agenda Kafka & Co.
  27. ⚠Warning ➔ Stateful & Stateless Pods → k8s operator ➔

    Service Dependencies → init container ➔ Storage Type → persistent volume storage class ➔ JVM Heap & Container Memory → mem. limits & requests Agenda Kafka & Co.
  28. Flink

  29. Flink

  30. ⚠Warning ➔ JVM Heap & RocksDB Memory & Container Memory

    ➔ State Backend ➔ HA Setup ➔ Rootless Container with random UID Flink
  31. ⚠Warning ➔ JVM Heap & RocksDB Memory & Container Memory

    → explicit allocation ➔ State Backend ➔ HA Setup ➔ Rootless Container with random UID Flink
  32. ⚠Warning ➔ JVM Heap & RocksDB Memory & Container Memory

    → explicit allocation ➔ State Backend → e.g. HDFS ➔ HA Setup ➔ Rootless Container with random UID Flink
  33. ⚠Warning ➔ JVM Heap & RocksDB Memory & Container Memory

    → explicit allocation ➔ State Backend → e.g. HDFS ➔ HA Setup → e.g. HDFS ➔ Rootless Container with random UID Flink
  34. ⚠Warning ➔ JVM Heap & RocksDB Memory & Container Memory

    → explicit allocation ➔ State Backend → e.g. HDFS ➔ HA Setup → e.g. HDFS ➔ Rootless Container with random UID → Build your own Docker Image Flink
  35. PostgreSQL

  36. ⚠Warning ➔ Stateful Pod & replication ➔ Storage Type PostgreSQL

  37. ⚠Warning ➔ Stateful Pod & replication → k8s operator ➔

    Storage Type PostgreSQL
  38. ⚠Warning ➔ Stateful Pod & replication → k8s operator ➔

    Storage Type → fast without cache pv storage class PostgreSQL
  39. Seed the data into PostgreSQL

  40. ghosts id: INT name: TEXT 2 Slimer movies id: INT

    name: TEXT year: INT 1 Ghostbusters 1984 2 Ghostbusters II 1989 ghosts_in_movies ghost_id: INT movie_id: INT id: INT 2 1 2 2 2 3 Seed the data into PostgreSQL (seed.sql)
  41. Feed the Streams

  42. Feed the Streams Feed the Streams

  43. ⚠Warning: ➔ Rebalancing at each Config Update (fixed in 2.3)

    ➔ Connector JARs Location Kafka Connect Setup
  44. ⚠Warning: ➔ Rebalancing at each Config Update (fixed in 2.3)

    → upgrade ➔ Connector JARs Location Kafka Connect Setup
  45. ⚠Warning: ➔ Rebalancing at each Config Update (fixed in 2.3)

    → upgrade ➔ Connector JARs Location → Deps JARs close to plugin JAR Kafka Connect Setup
  46. Kafka Connect JDBC Source

  47. Main Lessons • Memory Allocation • Rootless container • Stateful

    vs Stateless • Throughput vs Latency Main lessons
  48. Crossing the Streams Flink Job

  49. Query Example

  50. ghosts id: INT name: TEXT 2 Slimer movies id: INT

    name: TEXT year: INT 1 Ghostbusters 1984 2 Ghostbusters II 1989 ghosts_in_movies ghost_id: INT movie_id: INT id: INT 2 1 2 2 2 3 Query Result Example
  51. ghost_appearances name: TEXT movie: TEXT Slimer Ghostbusters Slimer Ghostbusters II

    Query Result Example
  52. Unless You Have No Choice

  53. Main Lessons Unless You Have No Choice

  54. Or …

  55. Or …

  56. Main Lessons Or you have Flink

  57. Unit test

  58. Unit test

  59. Unit test

  60. 1. TableEnvironment 2. TableSource 3. sqlQuery 4. TableSink 5. Execute

    Job
  61. Job

  62. Job

  63. Job

  64. Job

  65. Submit Job

  66. Submit Job

  67. Submit Job

  68. Submit Job

  69. • Transparent State Management • Transparent Fault-Tolerance • Undocumented Rich

    Test Helpers • Required Topic Main Lessons Main lessons
  70. Crossing the Streams Outro

  71. • JVM & RocksDB Memory in Docker Container • K8s

    Stateful Pod and Scaling • Service Dependencies (init container) Main Lessons Operations Pitfalls
  72. • Deployment as Code • Throughput vs Latency Main Lessons

    Operations Pitfalls
  73. • Rebalancing • JARs Location • Development Environment docker run

    \ --net=host lensesio/fast-data-dev Main Lessons Kafka Connect Pitfalls
  74. Main Lessons Flink Job Pitfalls • Required Kafka Topic

  75. Result

  76. What have we achieved? Kafka Connect JDBC Source Rss Mgt

    SQL Query & Fault Tolerance State Mgt
  77. toch @_toch ibakesoftware.com Thanks! toch/sf-kafka-summit-2019

  78. None
  79. • Global Window for INNER JOIN • ANSI SQL •

    Partitioning Freedom Main Lessons Why not KSQL?
  80. • Temporal Table • CEP Main Lessons Why not KSQL?