Slide 1

Slide 1 text

5-minute Practical Streaming Techniques that can Save You Millions Zhenzhong Xu Cofounder & CTO @ claypot.ai Sept, 2023

Slide 2

Slide 2 text

Trust & safety fintech, cybersecurity, social media, ecommerce Use case drivers for streaming pipelines Dynamic pricing ecommerce, gaming Recommender system ecommerce, social media, entertainment Ads optimization ecommerce, social media In-the moment coordination Logistics, healthcare, customer support

Slide 3

Slide 3 text

Data staleness, even by just one hour, costs money LinkedIn’s Real-time Anti-abuse (2022) Moving from an offline pipeline (hours) to real-time pipeline (minutes) led to +30% in bad actors caught online and +21% in fake account detection Instacart: The Journey to Real-Time Machine Learning (2022) Real-time pipeline directly reduces millions of fraud-related costs annually How WhatsApp catches and fights abuse (2022 | slides) A few 100ms delay can increase the spam by 20-30% How Pinterest Leverages Realtime User Actions to Boost Engagement (2022) One of our most impactful innovations recently, increasing Home feed engagement by 11% while reducing Pinner hide volume by 10%” Airbnb: Real-time Personalization using Embeddings for Search Ranking (2018) Moving from offline scoring to online scoring grows bookings by +5.1%

Slide 4

Slide 4 text

reference: Open Problems in Stream Processing: A Call To Action, Tyler Akidau (2019) Correctness Low cost Low latency 1. Fast & Correct 2. Cheap & Correct 3. Fast & Cheap Optimization Goals

Slide 5

Slide 5 text

Latency: the big picture 5 Service A Service B … Service X Tx DWH (Backfill) Online Store Data Processing Network: 10s ms Network latency: 10s ms Offline Store Online Serving latency: 10s ms Inference Training Offline Serving latency: 1-10s sec Computation latency: 10 ms to hours Network: 10s ms 1-10 mins Backfill Catch Up Latency: mins to hours In-memory DataFrame Local Experimentation

Slide 6

Slide 6 text

Computation Latency ● Latency Marker Latency = Sum(marker_latency_per_operator) + emission_cadence ● Event time lag Event time lag = current processing time - current watermark Backfill Latency ● Backfill latency is the time difference between when the user initiates back-processing until the pipeline catches up and stabilizes to the point that event time lag is within reasonable bound. ● Watermark progression / wall clock time 6 How to measure Latencies

Slide 7

Slide 7 text

Cost: the big picture 7 Service A Service B … Service X Tx DWH (Backfill) Online Store Data Processing Data Movement Network latency: 10s ms Offline Store Serving Training Computation Cost In-memory DataFrame Experimentation Storage Cost Storage Cost Data Movement Data Storage Data Storage Computation Cost Computation Cost

Slide 8

Slide 8 text

Computation Processing: ● Events ingestion speed and variance/spike pattern ● Complexity of transformation/aggregation ● Results emission frequency and percentage of data recomputation ● High-level computation overhead such as serialization/deserialization ● Low-level computation overhead such as disk <> memory <> CPU register <> instruction set bus utilization ● Network latency between systems. State management: ● Per event size ● Window length ● Total keyspace and in-state keyspace ● IO latency in state store 8 Factors impacting cost

Slide 9

Slide 9 text

Storage ● Access pattern / Data structure ● Volume/scale ● Keyspace ● Cold vs. hot storage Data Movement ● Data close to compute vs data close to storage or both. ● Data transmitted within the same VPC will be cheaper. Data transmitted over the internet backbone will be more expensive (cross-cloud provider, cross-region). ● Whether data transfer requires TLS encryption. ● Whether data transfer requires privacy governance. 9 Factors impacting cost

Slide 10

Slide 10 text

Consider optimization knobs systematically To speed things up ● Use more power, so you get to do more in less time ● Process less, so you get to finish on time ● Spend less time waiting, so you get to finish earlier ● Use better execution plan, so you get work done faster and smarter To lower cost ● Less redundant things, so you can spend less on compute and storage ● Use the more cost effective hardware/technology, so you can push expensive cost to cheap ones

Slide 11

Slide 11 text

Example: Fraud Detection Feature Optimization Scenario: ● 10MM active credit cards ● 30% Daily Active Cards ● 1-2% Hourly Active Cards during peak ● Active cards have an average of 10 swipes per day Simulation parameters: ● 1000 transactions per second ● 200k unique cards within an hour Field Type Description cc_num String Credit card number amt Float Transaction amount unix_time Int64 Unix timestamp of the transaction merchant String Merchant name zipcode String Zip code of the transaction location category String Category of the transaction city String City of the transaction location state String State of the transaction location

Slide 12

Slide 12 text

Let’s compute: Average transaction amount from a twelve-months window till now, per user/credit card. We want the results emitted no later than one second after each transaction event is received. 12

Slide 13

Slide 13 text

SELECT cc_num, AVG(amt) AS avg_amt FROM TABLE( HOP(TABLE transactions, DESCRIPTOR(unix_time), INTERVAL '1' SECONDS, INTERVAL '2' HOURS) ) GROUP BY cc_num, window_start, window_end 13

Slide 14

Slide 14 text

SELECT cc_num, AVG(amt) AS avg_amt FROM TABLE( HOP(TABLE transactions, DESCRIPTOR(unix_time), INTERVAL '1' SECONDS, INTERVAL '2' HOURS) ) GROUP BY cc_num, window_start, window_end 14 Windowing Table-Valued-Function (Hop/Sliding windows)

Slide 15

Slide 15 text

SELECT cc_num, AVG(amt) AS avg_amt FROM TABLE( HOP(TABLE transactions, DESCRIPTOR(unix_time), INTERVAL '1' SECONDS, INTERVAL '2' HOURS) ) GROUP BY cc_num, window_start, window_end 15 Windowing Table-Valued-Function (Hop/Sliding windows) cc 1 cc 2 cc 3

Slide 16

Slide 16 text

This will cost about $8,400 / month! 16

Slide 17

Slide 17 text

17 Inaccuracy: none of the window firings capture all 5 events!

Slide 18

Slide 18 text

18 We can compensate by shortening the window slide.

Slide 19

Slide 19 text

Drill into window firings 19 time cc: 123 cc: 124 …… Sliding Window 1 Window 2 Window 3

Slide 20

Slide 20 text

Drill into State Access 20 { "cc": 123, "zip code": 94040 } … …… time cc: 123 cc: 124 Ct: 4 Ct: 3 ……

Slide 21

Slide 21 text

Challenges Due to Skew 21 { "cc": 123, "zip code": 94040 } … …… time cc: 123 cc: 124 Ct: 1M Ct: 2 ……

Slide 22

Slide 22 text

SELECT cc_num, AVG(amt) OVER w AS avg_amt FROM `transactions` WINDOW w AS ( PARTITION BY cc_num ORDER BY unix_time RANGE BETWEEN INTERVAL '2' HOURS PRECEDING AND CURRENT ROW ) 22 Over-aggregation window

Slide 23

Slide 23 text

$8,400 -> $69 / month 23

Slide 24

Slide 24 text

Accuracy Challenges Inaccuracy as this point in time.

Slide 25

Slide 25 text

SELECT cc_num, AVG(amt) OVER w AS avg_amt FROM `transactions` WINDOW w AS ( PARTITION BY cc_num ORDER BY unix_time RANGE BETWEEN INTERVAL '2' HOURS PRECEDING AND CURRENT ROW ) 25 Alternatives?

Slide 26

Slide 26 text

Sliding Window Over Aggregation Performance 95th latency (ms) 55986 74.4 Mean lag (events) 22770 788 Mean emissions per second 179242 1003 Cost CPU cores 224 2 CPU Utilization (%) 90.70% 0.49% Amortized cost $8,435.91 $69.02 Correctness Point-in-time accuracy Reasonably Accurate Worse when changes are slower

Slide 27

Slide 27 text

Sliding Window Over Aggregation Performance 95th freshness (ms) 55986 <- spectrum -> 74.4 Mean lag (events) 22770 788 Mean emissions per second 179242 1003 Cost CPU cores 224 2 CPU Utilization (%) 90.70% 0.49% Amortized cost $8,435.91 <- spectrum -> $69.02 Correctness Point-in-time accuracy Reasonably Accurate <- spectrum -> Worse when changes are slower within window bound As a user, what knobs do I have to make appropriate tradeoffs between computation freshness, cost, and accuracy?

Slide 28

Slide 28 text

Sliding Window Over Aggregation Performance 95th freshness (ms) 55986 <- spectrum -> 74.4 Mean lag (events) 22770 788 Mean emissions per second 179242 1003 Cost CPU cores 224 2 CPU Utilization (%) 90.70% 0.49% Amortized cost $8,435.91 <- spectrum -> $69.02 Correctness Point-in-time accuracy Accurate <- spectrum -> Worse when changes are slower within window bound ( SELECT cc_num, AVG(amt) OVER w AS avg_amt FROM `transactions` WINDOW w AS ( PARTITION BY cc_num ORDER BY unix_time RANGE BETWEEN INTERVAL '2' HOURS PRECEDING AND CURRENT ROW) ) UNION ( SELECT cc_num, AVG(amt) AS avg_amt FROM TABLE(HOP(TABLE transactions, DESCRIPTOR(unix_time), INTERVAL '60' SECONDS, INTERVAL '2' HOURS)) GROUP BY cc_num, window_start, window_end ); Inaccuracy tolerance of 60 seconds Results always fresh after an triggering event

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

More advanced optimizations ● Optimizing across composable data fabric

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

Workload Compiler/Optimizer Deployment Relational Expression @transformation def transaction_count(tx: Transactions): return tx[tx.status == "failed"].groupby("account_id").rolling().count() Python/SQL Interchangeable Declarations

Slide 33

Slide 33 text

Workload Declaration IR Online store Offline store Filter Scan Scan Union Join Unified Processing Filter

Slide 34

Slide 34 text

Workload Declaration IR Online store Offline store Filter Scan Scan Union Join Unified Processing Filter Distributed Predicate Push Down/Up

Slide 35

Slide 35 text

Workload Declaration IR Online store Offline store Filter Scan Scan Union Join Unified Processing Filter Unified Streaming Data Lakehouse

Slide 36

Slide 36 text

Workload Declaration IR Online store Offline store Filter Scan Scan Union Join Unified Processing Filter Composable & Unified Processing

Slide 37

Slide 37 text

A modern data fabric for ML can benefit from an intelligent, distributed, yet intuitive optimization layer. https://zhenzhongxu.com/

Slide 38

Slide 38 text

Local/Single Machine Remote/Distributed Need an invisible interface to plug into compute ecosystems

Slide 39

Slide 39 text

Streaming Leaning Batch Leaning Need an invisible interface to plug into storage ecosystems

Slide 40

Slide 40 text

Data Fabric for a Streaming Pipeline

Slide 41

Slide 41 text

Data Fabric for a Unified Backfill Pipeline

Slide 42

Slide 42 text

Training dataset backfill requires point-in-time correctness Time Feature data Feature data Feature data Prediction events Feature data

Slide 43

Slide 43 text

Point-in-time joins to generate training data 43 Proprietary & Confidential Given a spine (entity keys + timestamp + label), join features to generate training data spine_df train_df = pitc_join_features( spine_df, features=[ "tx_max_1h", "user_unique_ip_30d", ], ) inference_ts tid cc_num user_id is_fraud 21:30 0122 2 1 0 21:40 0298 4 1 0 21:55 7539 6 3 1 inference_ts tid cc_num user_id is_fraud tx_max_1h user_unique_ip_30d 21:30 0122 2 1 1 … … 21:40 0298 4 1 1 … … 21:55 7539 6 3 3 … … ts cc_num tx_max_1h 9:20 2 … 10:24 2 … 20:00 4 … cc_num_tx_max_1h ts user_id unique_ip_30d 6:00 1 … 6:00 3 … 6:00 5 … user_unique_id_30d