Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Quix Streams—a Python-Kafka Library for Data-In...

Quix Streams—a Python-Kafka Library for Data-Intensive Workloads (Tomas Neubauer, Quix) | RTA Summit 2023

This talk will introduce Quix Streams, an open-source Python library for data-intensive workloads on Kafka.

We will discuss the unique problems that this library is designed to solve, and how it was shaped by the challenges building a Kafka-based solution for Formula 1 cars at McLaren—a solution that needed to process a colossal firehose of sensor data coming in at thousands of samples per second. We’ll also explain why we decided to combine a Kafka API approach with a stream processing library and provide developers with a familiar Pandas DataFrame-like interface.

You’ll also see the library in action with a sentiment analysis demo.

StarTree

May 23, 2023
Tweet

More Decks by StarTree

Other Decks in Technology

Transcript

  1. Quix streams / RTASummit 2023 Hello, nice to meet you!

    CTO and co-founder at Quix Previously McLaren technical lead 2 Tomas Neubauer
  2. Quix streams / RTASummit 2023 Racing background Roots in real-time

    data processing in the most extreme, time-critical environment • 50,000 channels per car • 1.5 kHz per channel • 1,000s realtime models and simulations 3
  3. Quix streams / RTASummit 2023 • Streaming vs batch •

    ML deployment • Streaming landscape • How it works • Demo Let's build it Content
  4. Quix streams / RTASummit 2023 Streaming VS Batch 1 9

    An overview of data processing approaches
  5. Quix streams / RTASummit 2023 10 10 phone-data Crash detection

    alerts API gateway/websocket ANALYSIS & TRAINING API gateway/websocket Batch trained model Kinesis SageMaker Step Function EMR
  6. Quix streams / RTASummit 2023 Streaming 11 11 phone-data Websocket

    gateway Streaming Crash detection alerts Websocket gateway ANALYSIS & TRAINING trained model SageMaker
  7. Quix streams / RTASummit 2023 Data is collected over time

    into a database. At some point data is loaded from the database to the processing system. • Operations are easy to compute: all historic data is present • Results are not on real time Data Processing Batch t Gx Gy Gz t1 0.1 1.0 0.2 t2 0.2 1.1 0.1 t3 0.1 0.9 0.1 tn 0.3 1.0 0 GT 1.3 1.4 1.1 1.3 𝚫G 0.1 -0.3 1.3 GTn-1 </>
  8. Quix streams / RTASummit 2023 Data is collected over time

    into a streaming broker (like a Kafka topic). The processing system consumes the data as soon as it’s published in the topic. • Operations are not always easy to compute: state is used to keep needed historic data • Real time results Data Processing Streaming t Gx Gy Gz t1 0.1 1.0 0.2 GT 1.3 𝚫G </> STATE t GT t1 1.3 t Gx Gy Gz t2 0.2 1.1 0.1 GT 1.4 𝚫G 0.1 t GT t2 1.4 t Gx Gy Gz t3 0.1 0.9 0.1 GT 1.1 𝚫G -0.3 t Gx Gy Gz tn 0.3 1.0 0.1 GT 1.3 𝚫G 1.3 GTn-1 t GT tn-1 GTn-1
  9. Quix streams / RTASummit 2023 ML Deployment with API API

    REQUEST 15 REST API API RESPONSE APP gX gY gZ gTotal 0.5 0.3 0.1 0.9 gX gY gZ gTotal Crash 0.5 0.3 0.1 0.9 1
  10. Quix streams / RTASummit 2023 Problems with REST API 17

    gX gY gZ gTotal API REQUEST 17 REST API APP - CPU overhead - Introducing delay - Requests gets lost in case of service downtime or slow performance
  11. Quix streams / RTASummit 2023 Problems with REST API 18

    API REQUEST 18 REST API APP gX gY gZ gTotal
  12. Quix streams / RTASummit 2023 Problems with REST API 19

    API REQUEST 19 REST API APP gX gY gZ gTotal
  13. Quix streams / RTASummit 2023 Problems with REST API 20

    gX gY gZ gTotal API REQUEST 20 API APP API APP API REQUEST
  14. Quix streams / RTASummit 2023 Stream processing applications 03 21

    An overview of stream processing approaches
  15. Quix streams / RTASummit 2023 22 Stream processing applications When

    you building stream processing applications with Kafka, there are two options: 1. Just build an application that uses the Kafka producer and consumer APIs directly 2. Adopt a full-fledged stream processing framework (Flink, Spark streaming, Beam, etc)
  16. Quix streams / RTASummit 2023 23 Kafka producer and consumer

    APIs • Works for simple stuff like one-message-at-a-time processing • No external dependencies like JVM • Gets very complicated when stateful processing is needed like calculation aggregations or joining multiple streams
  17. Quix streams / RTASummit 2023 24 Stream processing frameworks •

    Fully fledged stream processing frameworks solves stateful, more complex operations • But it is for a cost of increased complexity in many dimensions: ◦ Java dependency ◦ Deployment gets difficult because code is not running on its own but in server side cluster (Flink cluster or Spark cluster) ◦ Debugging is difficult ◦ Performance optimization is difficult ◦ Gets even worse when we combine synchronous architecture with asynchronous in one application
  18. Quix streams / RTASummit 2023 28 28 IP UDFs are

    nasty • Poor development experience ◦ Logs only accessible from server, no debugging possible • Performance hit caused by interface between JVM and Python
  19. Quix streams / RTASummit 2023 30 30 Is there a

    third way? • Combining Kafka API approach with stream processing library • Abstraction from key-value messages of Kafka API to virtual tables • Standalone library that runs: ◦ Locally for development and debugging ◦ In docker or in Kubernetes for production deployments at scale
  20. Quix streams / RTASummit 2023 Stateful processing with Pub&Sub client

    libraries 31 1. Messages in topic 2. Split messages into individual streams 4. Messages decomposed into rows 5. Memory state updated from incoming rows / series 6. State persistence 3. Message converted to tables 7. State and incoming data is combined to output that is send to output topic Commit offsets
  21. Quix streams / RTASummit 2023 Quix streams 32 1. Messages

    in topic 2. Messages decomposed as rows available via pandas API 3. Messages processed through pipeline defined as pandas operations. Output streamed to output topic. - Automatic state management - Automatic checkpointing - Automatic message serialization/deserialization
  22. Quix streams / RTASummit 2023 Our approach to stream processing

    Containers Containers running in Kubernetes scaling hand to hand with Kafka for compute scalability. Kafka Handle your data reliably and efficiently in memory with Kafka. Using Kafka partitions, replica system and persistence to deliver scalability and robustness. Python Python gives you flexibility. It lets you transform data, not just query it. From simple filtering to ML usecases like video processing. 34
  23. Quix streams / RTASummit 2023 APP Processing with streaming SUB

    35 gForce X gForce Y gForce Z 0.5 0.3 0.1 INPUT TOPIC OUTPUT TOPIC PUB gForce X gForce Y gForc eZ gForce Total Crash 0.5 0.3 0.1 0.9 1
  24. Quix streams / RTASummit 2023 INPUT TOPIC Scale SUB 36

    PUB OUTPUT TOPIC gForce X gForce Y gForce Z 0.5 0.3 0.1 gForce X gForce Y gForc eZ gForce Total Crash 0.5 0.3 0.1 0.9 1
  25. Quix streams / RTASummit 2023 INPUT TOPIC Fault tolerant SUB

    37 PUB OUTPUT TOPIC gForce X gForce Y gForce Z 0.5 0.3 0.1 gForce X gForce Y gForc eZ gForce Total Crash 0.5 0.3 0.1 0.9 1