Quix Streams—a Python-Kafka Library for Data-Intensive Workloads (Tomas Neubauer, Quix) | RTA Summit 2023

Quix streams / RTASummit 2023 Quix Streams Real-time stream processing
in Python 1

Quix streams / RTASummit 2023 Hello, nice to meet you!
CTO and co-founder at Quix Previously McLaren technical lead 2 Tomas Neubauer

Quix streams / RTASummit 2023 Racing background Roots in real-time
data processing in the most extreme, time-critical environment • 50,000 channels per car • 1.5 kHz per channel • 1,000s realtime models and simulations 3

Quix streams / RTASummit 2023 Now, raise your hand if
you already knew about … 4

Quix streams / RTASummit 2023 Streaming Now, raise your hand
if you already knew about … 5

Quix streams / RTASummit 2023 Kafka Now, raise your hand
if you already knew about … 6

Quix streams / RTASummit 2023 7 phone-data Goal Crash detection
Fitness app crashes

Quix streams / RTASummit 2023 • Streaming vs batch •
ML deployment • Streaming landscape • How it works • Demo Let's build it Content

Quix streams / RTASummit 2023 Streaming VS Batch 1 9
An overview of data processing approaches

Quix streams / RTASummit 2023 10 10 phone-data Crash detection
alerts API gateway/websocket ANALYSIS & TRAINING API gateway/websocket Batch trained model Kinesis SageMaker Step Function EMR

Quix streams / RTASummit 2023 Streaming 11 11 phone-data Websocket
gateway Streaming Crash detection alerts Websocket gateway ANALYSIS & TRAINING trained model SageMaker

Quix streams / RTASummit 2023 Data is collected over time
into a database. At some point data is loaded from the database to the processing system. • Operations are easy to compute: all historic data is present • Results are not on real time Data Processing Batch t Gx Gy Gz t1 0.1 1.0 0.2 t2 0.2 1.1 0.1 t3 0.1 0.9 0.1 tn 0.3 1.0 0 GT 1.3 1.4 1.1 1.3 𝚫G 0.1 -0.3 1.3 GTn-1 </>

Quix streams / RTASummit 2023 Data is collected over time
into a streaming broker (like a Kafka topic). The processing system consumes the data as soon as it’s published in the topic. • Operations are not always easy to compute: state is used to keep needed historic data • Real time results Data Processing Streaming t Gx Gy Gz t1 0.1 1.0 0.2 GT 1.3 𝚫G </> STATE t GT t1 1.3 t Gx Gy Gz t2 0.2 1.1 0.1 GT 1.4 𝚫G 0.1 t GT t2 1.4 t Gx Gy Gz t3 0.1 0.9 0.1 GT 1.1 𝚫G -0.3 t Gx Gy Gz tn 0.3 1.0 0.1 GT 1.3 𝚫G 1.3 GTn-1 t GT tn-1 GTn-1

Quix streams / RTASummit 2023 ML Deployment 2 REST API
vs Streaming 14

Quix streams / RTASummit 2023 ML Deployment with API API
REQUEST 15 REST API API RESPONSE APP gX gY gZ gTotal 0.5 0.3 0.1 0.9 gX gY gZ gTotal Crash 0.5 0.3 0.1 0.9 1

Quix streams / RTASummit 2023 Issues with REST APIs 2.1
REST API vs Streaming 16

Quix streams / RTASummit 2023 Problems with REST API 17
gX gY gZ gTotal API REQUEST 17 REST API APP - CPU overhead - Introducing delay - Requests gets lost in case of service downtime or slow performance

API REQUEST 18 REST API APP gX gY gZ gTotal

API REQUEST 19 REST API APP gX gY gZ gTotal

gX gY gZ gTotal API REQUEST 20 API APP API APP API REQUEST

Quix streams / RTASummit 2023 Stream processing applications 03 21
An overview of stream processing approaches

Quix streams / RTASummit 2023 22 Stream processing applications When
you building stream processing applications with Kafka, there are two options: 1. Just build an application that uses the Kafka producer and consumer APIs directly 2. Adopt a full-fledged stream processing framework (Flink, Spark streaming, Beam, etc)

Quix streams / RTASummit 2023 23 Kafka producer and consumer
APIs • Works for simple stuff like one-message-at-a-time processing • No external dependencies like JVM • Gets very complicated when stateful processing is needed like calculation aggregations or joining multiple streams

Quix streams / RTASummit 2023 24 Stream processing frameworks •
Fully fledged stream processing frameworks solves stateful, more complex operations • But it is for a cost of increased complexity in many dimensions: ◦ Java dependency ◦ Deployment gets difficult because code is not running on its own but in server side cluster (Flink cluster or Spark cluster) ◦ Debugging is difficult ◦ Performance optimization is difficult ◦ Gets even worse when we combine synchronous architecture with asynchronous in one application

Quix streams / RTASummit 2023 JAR files….. 25

Quix streams / RTASummit 2023 Connecting Flink to Kafka is
difficult 26

Quix streams / RTASummit 2023 SQL looks easy to use
but… 27

Quix streams / RTASummit 2023 28 28 IP UDFs are
nasty • Poor development experience ◦ Logs only accessible from server, no debugging possible • Performance hit caused by interface between JVM and Python

Quix streams / RTASummit 2023 DEBUGGING!!! 29

Quix streams / RTASummit 2023 30 30 Is there a
third way? • Combining Kafka API approach with stream processing library • Abstraction from key-value messages of Kafka API to virtual tables • Standalone library that runs: ◦ Locally for development and debugging ◦ In docker or in Kubernetes for production deployments at scale

Quix streams / RTASummit 2023 Stateful processing with Pub&Sub client
libraries 31 1. Messages in topic 2. Split messages into individual streams 4. Messages decomposed into rows 5. Memory state updated from incoming rows / series 6. State persistence 3. Message converted to tables 7. State and incoming data is combined to output that is send to output topic Commit offsets

Quix streams / RTASummit 2023 Quix streams 32 1. Messages
in topic 2. Messages decomposed as rows available via pandas API 3. Messages processed through pipeline defined as pandas operations. Output streamed to output topic. - Automatic state management - Automatic checkpointing - Automatic message serialization/deserialization

Quix streams / RTASummit 2023 How it works? 04 Kafka
Kubernetes Python 33

Quix streams / RTASummit 2023 Our approach to stream processing
Containers Containers running in Kubernetes scaling hand to hand with Kafka for compute scalability. Kafka Handle your data reliably and efficiently in memory with Kafka. Using Kafka partitions, replica system and persistence to deliver scalability and robustness. Python Python gives you flexibility. It lets you transform data, not just query it. From simple filtering to ML usecases like video processing. 34

Quix streams / RTASummit 2023 APP Processing with streaming SUB
35 gForce X gForce Y gForce Z 0.5 0.3 0.1 INPUT TOPIC OUTPUT TOPIC PUB gForce X gForce Y gForc eZ gForce Total Crash 0.5 0.3 0.1 0.9 1

Quix streams / RTASummit 2023 INPUT TOPIC Scale SUB 36
PUB OUTPUT TOPIC gForce X gForce Y gForce Z 0.5 0.3 0.1 gForce X gForce Y gForc eZ gForce Total Crash 0.5 0.3 0.1 0.9 1

Quix streams / RTASummit 2023 INPUT TOPIC Fault tolerant SUB
37 PUB OUTPUT TOPIC gForce X gForce Y gForce Z 0.5 0.3 0.1 gForce X gForce Y gForc eZ gForce Total Crash 0.5 0.3 0.1 0.9 1

Quix streams / RTASummit 2023 Demo Lets build it. 38

Quix streams / RTASummit 2023

Quix Streams—a Python-Kafka Library for Data-In...

Quix Streams—a Python-Kafka Library for Data-Intensive Workloads (Tomas Neubauer, Quix) | RTA Summit 2023

More Decks by StarTree

Other Decks in Technology

Featured

Transcript