$30 off During Our Annual Pro Sale. View Details »

SF Kafka Summit 2019: Cross the Streams Thanks to Kafka and Flink

SF Kafka Summit 2019: Cross the Streams Thanks to Kafka and Flink

The last 5 years, Kafka and Flink have become mature technologies that have allowed us to embrace the streaming paradigm. You can bet on them to build reliable and efficient applications. They are active projects backed by companies using them in production. They have a good community contributing, and sharing experience and knowledge. Kafka and Flink are solid choices if you want to build a data platform that your data scientists or developers can use to collect, process, and distribute data. You can put together Kafka Connect, Kafka, and Flink. First, you will take care of their deployment. Then, for each case, you will setup each part, and of course develop the Flink job so it can integrate easily with the rest. Looks like a challenging but exciting project, isn’t it? In this session, you will learn how you can build such data platform, what are the nitty-gritty of each part, how you can plug them together, in particular how to plug Flink in the Kafka ecosystem, what are the common pitfalls to avoid, and what it requires to be deployed on kubernetes. Even if you are not familiar with all the technologies, there will be enough introduction so you can follow. Come and learn how we can actually cross the streams!

Code example: https://github.com/toch/sf-kafka-summit-2019/

Christophe Philemotte

September 30, 2019
Tweet

More Decks by Christophe Philemotte

Other Decks in Programming

Transcript

  1. View Slide

  2. Cross the Streams
    thanks to Kafka and Flink
    Christophe Philemotte

    View Slide

  3. Hello,
    I am Christophe
    CTO, digazu
    toch
    @_toch
    ibakesoftware.com

    View Slide

  4. View Slide

  5. Crossing
    the Streams

    View Slide

  6. Crossing
    the Streams
    Motivations

    View Slide

  7. 10+ years ago: DIY
    2000 2010
    Stream Processing Papers
    Stratosphere
    Kafka
    Lambda architecture
    Kappa architecture

    View Slide

  8. 10+ years ago: DIY

    View Slide

  9. 10+ years ago: DIY
    Resource Management?
    Collection? Distribution?
    Consistency?
    Planning DAG?
    Distributed Processing?
    Only Once Semantic?
    Stateful Continuous Streaming?

    View Slide

  10. View Slide

  11. Cross the Tables!

    View Slide

  12. Today
    Today

    View Slide

  13. Ecosystem Integration?
    Deployment?
    Resource Scaling?
    Stateful Operation?
    Today

    View Slide

  14. Kafka
    Connect
    JDBC Source
    Our goal

    View Slide

  15. Our goal Run Any Query

    View Slide

  16. Agenda
    ● Stack deployment
    ● Flink Job
    ● Outro
    Agenda

    View Slide

  17. ● Sandbox:
    toch/sf-kafka-summit-2019
    ● Lessons From Production
    ● Integration How-to of Kafka and Flink Ecosystems
    ● SQL Streaming How-to
    Agenda
    Takeaways

    View Slide

  18. Crossing
    the Streams
    Stack deployment

    View Slide

  19. Kafka
    Connect
    JDBC Source

    View Slide

  20. Agenda
    Kafka & Co.

    View Slide

  21. Agenda
    Kafka & Co.

    View Slide

  22. Agenda
    Kafka & Co.

    View Slide

  23. ⚠Warning
    ➔ Stateful & Stateless Pods
    ➔ Service Dependencies
    ➔ Storage Type
    ➔ JVM Heap & Container Memory
    Agenda
    Kafka & Co.

    View Slide

  24. ⚠Warning
    ➔ Stateful & Stateless Pods → k8s operator
    ➔ Service Dependencies
    ➔ Storage Type
    ➔ JVM Heap & Container Memory
    Agenda
    Kafka & Co.

    View Slide

  25. ⚠Warning
    ➔ Stateful & Stateless Pods → k8s operator
    ➔ Service Dependencies → init container
    ➔ Storage Type
    ➔ JVM Heap & Container Memory
    Agenda
    Kafka & Co.

    View Slide

  26. ⚠Warning
    ➔ Stateful & Stateless Pods → k8s operator
    ➔ Service Dependencies → init container
    ➔ Storage Type → persistent volume storage class
    ➔ JVM Heap & Container Memory
    Agenda
    Kafka & Co.

    View Slide

  27. ⚠Warning
    ➔ Stateful & Stateless Pods → k8s operator
    ➔ Service Dependencies → init container
    ➔ Storage Type → persistent volume storage class
    ➔ JVM Heap & Container Memory → mem. limits & requests
    Agenda
    Kafka & Co.

    View Slide

  28. Flink

    View Slide

  29. Flink

    View Slide

  30. ⚠Warning
    ➔ JVM Heap & RocksDB Memory & Container Memory
    ➔ State Backend
    ➔ HA Setup
    ➔ Rootless Container with random UID
    Flink

    View Slide

  31. ⚠Warning
    ➔ JVM Heap & RocksDB Memory & Container Memory → explicit allocation
    ➔ State Backend
    ➔ HA Setup
    ➔ Rootless Container with random UID
    Flink

    View Slide

  32. ⚠Warning
    ➔ JVM Heap & RocksDB Memory & Container Memory → explicit allocation
    ➔ State Backend → e.g. HDFS
    ➔ HA Setup
    ➔ Rootless Container with random UID
    Flink

    View Slide

  33. ⚠Warning
    ➔ JVM Heap & RocksDB Memory & Container Memory → explicit allocation
    ➔ State Backend → e.g. HDFS
    ➔ HA Setup → e.g. HDFS
    ➔ Rootless Container with random UID
    Flink

    View Slide

  34. ⚠Warning
    ➔ JVM Heap & RocksDB Memory & Container Memory → explicit allocation
    ➔ State Backend → e.g. HDFS
    ➔ HA Setup → e.g. HDFS
    ➔ Rootless Container with random UID → Build your own Docker Image
    Flink

    View Slide

  35. PostgreSQL

    View Slide

  36. ⚠Warning
    ➔ Stateful Pod & replication
    ➔ Storage Type
    PostgreSQL

    View Slide

  37. ⚠Warning
    ➔ Stateful Pod & replication → k8s operator
    ➔ Storage Type
    PostgreSQL

    View Slide

  38. ⚠Warning
    ➔ Stateful Pod & replication → k8s operator
    ➔ Storage Type → fast without cache pv storage class
    PostgreSQL

    View Slide

  39. Seed the data into PostgreSQL

    View Slide

  40. ghosts
    id: INT name: TEXT
    2 Slimer
    movies
    id: INT name: TEXT year: INT
    1 Ghostbusters 1984
    2 Ghostbusters II 1989
    ghosts_in_movies
    ghost_id: INT movie_id: INT id: INT
    2 1 2
    2 2 3
    Seed the data into PostgreSQL (seed.sql)

    View Slide

  41. Feed the Streams

    View Slide

  42. Feed the Streams
    Feed the Streams

    View Slide

  43. ⚠Warning:
    ➔ Rebalancing at each Config Update (fixed in 2.3)
    ➔ Connector JARs Location
    Kafka Connect Setup

    View Slide

  44. ⚠Warning:
    ➔ Rebalancing at each Config Update (fixed in 2.3) → upgrade
    ➔ Connector JARs Location
    Kafka Connect Setup

    View Slide

  45. ⚠Warning:
    ➔ Rebalancing at each Config Update (fixed in 2.3) → upgrade
    ➔ Connector JARs Location → Deps JARs close to plugin JAR
    Kafka Connect Setup

    View Slide

  46. Kafka
    Connect
    JDBC Source

    View Slide

  47. Main Lessons
    ● Memory Allocation
    ● Rootless container
    ● Stateful vs Stateless
    ● Throughput vs Latency
    Main lessons

    View Slide

  48. Crossing
    the Streams
    Flink Job

    View Slide

  49. Query Example

    View Slide

  50. ghosts
    id: INT name: TEXT
    2 Slimer
    movies
    id: INT name: TEXT year: INT
    1 Ghostbusters 1984
    2 Ghostbusters II 1989
    ghosts_in_movies
    ghost_id: INT movie_id: INT id: INT
    2 1 2
    2 2 3
    Query Result Example

    View Slide

  51. ghost_appearances
    name: TEXT movie: TEXT
    Slimer Ghostbusters
    Slimer Ghostbusters II
    Query Result Example

    View Slide

  52. Unless You Have No Choice

    View Slide

  53. Main Lessons
    Unless You Have No Choice

    View Slide

  54. Or …

    View Slide

  55. Or …

    View Slide

  56. Main Lessons
    Or you have Flink

    View Slide

  57. Unit test

    View Slide

  58. Unit test

    View Slide

  59. Unit test

    View Slide

  60. 1. TableEnvironment
    2. TableSource
    3. sqlQuery
    4. TableSink
    5. Execute
    Job

    View Slide

  61. Job

    View Slide

  62. Job

    View Slide

  63. Job

    View Slide

  64. Job

    View Slide

  65. Submit Job

    View Slide

  66. Submit Job

    View Slide

  67. Submit Job

    View Slide

  68. Submit Job

    View Slide

  69. ● Transparent State Management
    ● Transparent Fault-Tolerance
    ● Undocumented Rich Test Helpers
    ● Required Topic
    Main Lessons
    Main lessons

    View Slide

  70. Crossing
    the Streams
    Outro

    View Slide

  71. ● JVM & RocksDB Memory in Docker Container
    ● K8s Stateful Pod and Scaling
    ● Service Dependencies (init container)
    Main Lessons
    Operations Pitfalls

    View Slide

  72. ● Deployment as Code
    ● Throughput vs Latency
    Main Lessons
    Operations Pitfalls

    View Slide

  73. ● Rebalancing
    ● JARs Location
    ● Development Environment
    docker run \
    --net=host lensesio/fast-data-dev
    Main Lessons
    Kafka Connect Pitfalls

    View Slide

  74. Main Lessons
    Flink Job Pitfalls
    ● Required Kafka Topic

    View Slide

  75. Result

    View Slide

  76. What have we achieved?
    Kafka
    Connect
    JDBC Source
    Rss Mgt
    SQL Query & Fault Tolerance
    State Mgt

    View Slide

  77. toch
    @_toch
    ibakesoftware.com
    Thanks!
    toch/sf-kafka-summit-2019

    View Slide

  78. View Slide

  79. ● Global Window for INNER JOIN
    ● ANSI SQL
    ● Partitioning Freedom
    Main Lessons
    Why not KSQL?

    View Slide

  80. ● Temporal Table
    ● CEP
    Main Lessons
    Why not KSQL?

    View Slide