[2023] 5-minute Practical Streaming Techniques That Can Save You Millions

5-minute Practical Streaming Techniques that can Save You Millions Zhenzhong
Xu Cofounder & CTO @ claypot.ai Sept, 2023

Trust & safety fintech, cybersecurity, social media, ecommerce Use case
drivers for streaming pipelines Dynamic pricing ecommerce, gaming Recommender system ecommerce, social media, entertainment Ads optimization ecommerce, social media In-the moment coordination Logistics, healthcare, customer support

Data staleness, even by just one hour, costs money LinkedIn’s
Real-time Anti-abuse (2022) Moving from an offline pipeline (hours) to real-time pipeline (minutes) led to +30% in bad actors caught online and +21% in fake account detection Instacart: The Journey to Real-Time Machine Learning (2022) Real-time pipeline directly reduces millions of fraud-related costs annually How WhatsApp catches and ﬁghts abuse (2022 | slides) A few 100ms delay can increase the spam by 20-30% How Pinterest Leverages Realtime User Actions to Boost Engagement (2022) One of our most impactful innovations recently, increasing Home feed engagement by 11% while reducing Pinner hide volume by 10%” Airbnb: Real-time Personalization using Embeddings for Search Ranking (2018) Moving from offline scoring to online scoring grows bookings by +5.1%

reference: Open Problems in Stream Processing: A Call To Action,
Tyler Akidau (2019) Correctness Low cost Low latency 1. Fast & Correct 2. Cheap & Correct 3. Fast & Cheap Optimization Goals

Latency: the big picture 5 Service A Service B …
Service X Tx DWH (Backfill) Online Store Data Processing Network: 10s ms Network latency: 10s ms Offline Store Online Serving latency: 10s ms Inference Training Offline Serving latency: 1-10s sec Computation latency: 10 ms to hours Network: 10s ms 1-10 mins Backfill Catch Up Latency: mins to hours In-memory DataFrame Local Experimentation

Computation Latency • Latency Marker Latency = Sum(marker_latency_per_operator) + emission_cadence
• Event time lag Event time lag = current processing time - current watermark Backﬁll Latency • Backﬁll latency is the time difference between when the user initiates back-processing until the pipeline catches up and stabilizes to the point that event time lag is within reasonable bound. • Watermark progression / wall clock time 6 How to measure Latencies

Cost: the big picture 7 Service A Service B …
Service X Tx DWH (Backﬁll) Online Store Data Processing Data Movement Network latency: 10s ms Ofﬂine Store Serving Training Computation Cost In-memory DataFrame Experimentation Storage Cost Storage Cost Data Movement Data Storage Data Storage Computation Cost Computation Cost

Computation Processing: • Events ingestion speed and variance/spike pattern •
Complexity of transformation/aggregation • Results emission frequency and percentage of data recomputation • High-level computation overhead such as serialization/deserialization • Low-level computation overhead such as disk <> memory <> CPU register <> instruction set bus utilization • Network latency between systems. State management: • Per event size • Window length • Total keyspace and in-state keyspace • IO latency in state store 8 Factors impacting cost

Storage • Access pattern / Data structure • Volume/scale •
Keyspace • Cold vs. hot storage Data Movement • Data close to compute vs data close to storage or both. • Data transmitted within the same VPC will be cheaper. Data transmitted over the internet backbone will be more expensive (cross-cloud provider, cross-region). • Whether data transfer requires TLS encryption. • Whether data transfer requires privacy governance. 9 Factors impacting cost

Consider optimization knobs systematically To speed things up • Use
more power, so you get to do more in less time • Process less, so you get to finish on time • Spend less time waiting, so you get to finish earlier • Use better execution plan, so you get work done faster and smarter To lower cost • Less redundant things, so you can spend less on compute and storage • Use the more cost effective hardware/technology, so you can push expensive cost to cheap ones

Example: Fraud Detection Feature Optimization Scenario: • 10MM active credit
cards • 30% Daily Active Cards • 1-2% Hourly Active Cards during peak • Active cards have an average of 10 swipes per day Simulation parameters: • 1000 transactions per second • 200k unique cards within an hour Field Type Description cc_num String Credit card number amt Float Transaction amount unix_time Int64 Unix timestamp of the transaction merchant String Merchant name zipcode String Zip code of the transaction location category String Category of the transaction city String City of the transaction location state String State of the transaction location

Let’s compute: Average transaction amount from a twelve-months window till
now, per user/credit card. We want the results emitted no later than one second after each transaction event is received. 12

SELECT cc_num, AVG(amt) AS avg_amt FROM TABLE( HOP(TABLE transactions, DESCRIPTOR(unix_time),
INTERVAL '1' SECONDS, INTERVAL '2' HOURS) ) GROUP BY cc_num, window_start, window_end 13

INTERVAL '1' SECONDS, INTERVAL '2' HOURS) ) GROUP BY cc_num, window_start, window_end 14 Windowing Table-Valued-Function (Hop/Sliding windows)

INTERVAL '1' SECONDS, INTERVAL '2' HOURS) ) GROUP BY cc_num, window_start, window_end 15 Windowing Table-Valued-Function (Hop/Sliding windows) cc 1 cc 2 cc 3

This will cost about $8,400 / month! 16

17 Inaccuracy: none of the window ﬁrings capture all 5
events!

18 We can compensate by shortening the window slide.

Drill into window ﬁrings 19 time cc: 123 cc: 124
…… Sliding Window 1 Window 2 Window 3

Drill into State Access 20 { "cc": 123, "zip code":
94040 } … …… time cc: 123 cc: 124 Ct: 4 Ct: 3 ……

Challenges Due to Skew 21 { "cc": 123, "zip code":
94040 } … …… time cc: 123 cc: 124 Ct: 1M Ct: 2 ……

SELECT cc_num, AVG(amt) OVER w AS avg_amt FROM `transactions` WINDOW
w AS ( PARTITION BY cc_num ORDER BY unix_time RANGE BETWEEN INTERVAL '2' HOURS PRECEDING AND CURRENT ROW ) 22 Over-aggregation window

$8,400 -> $69 / month 23

Accuracy Challenges Inaccuracy as this point in time.

SELECT cc_num, AVG(amt) OVER w AS avg_amt FROM `transactions` WINDOW
w AS ( PARTITION BY cc_num ORDER BY unix_time RANGE BETWEEN INTERVAL '2' HOURS PRECEDING AND CURRENT ROW ) 25 Alternatives?

Sliding Window Over Aggregation Performance 95th latency (ms) 55986 74.4
Mean lag (events) 22770 788 Mean emissions per second 179242 1003 Cost CPU cores 224 2 CPU Utilization (%) 90.70% 0.49% Amortized cost $8,435.91 $69.02 Correctness Point-in-time accuracy Reasonably Accurate Worse when changes are slower

Sliding Window Over Aggregation Performance 95th freshness (ms) 55986 <-
spectrum -> 74.4 Mean lag (events) 22770 788 Mean emissions per second 179242 1003 Cost CPU cores 224 2 CPU Utilization (%) 90.70% 0.49% Amortized cost $8,435.91 <- spectrum -> $69.02 Correctness Point-in-time accuracy Reasonably Accurate <- spectrum -> Worse when changes are slower within window bound As a user, what knobs do I have to make appropriate tradeoffs between computation freshness, cost, and accuracy?

Sliding Window Over Aggregation Performance 95th freshness (ms) 55986 <-
spectrum -> 74.4 Mean lag (events) 22770 788 Mean emissions per second 179242 1003 Cost CPU cores 224 2 CPU Utilization (%) 90.70% 0.49% Amortized cost $8,435.91 <- spectrum -> $69.02 Correctness Point-in-time accuracy Accurate <- spectrum -> Worse when changes are slower within window bound ( SELECT cc_num, AVG(amt) OVER w AS avg_amt FROM `transactions` WINDOW w AS ( PARTITION BY cc_num ORDER BY unix_time RANGE BETWEEN INTERVAL '2' HOURS PRECEDING AND CURRENT ROW) ) UNION ( SELECT cc_num, AVG(amt) AS avg_amt FROM TABLE(HOP(TABLE transactions, DESCRIPTOR(unix_time), INTERVAL '60' SECONDS, INTERVAL '2' HOURS)) GROUP BY cc_num, window_start, window_end ); Inaccuracy tolerance of 60 seconds Results always fresh after an triggering event

More advanced optimizations • Optimizing across composable data fabric

Workload Compiler/Optimizer Deployment Relational Expression @transformation def transaction_count(tx: Transactions): return
tx[tx.status == "failed"].groupby("account_id").rolling().count() Python/SQL Interchangeable Declarations

Workload Declaration IR Online store Ofﬂine store Filter Scan Scan
Union Join Uniﬁed Processing Filter

Union Join Uniﬁed Processing Filter Distributed Predicate Push Down/Up

Union Join Uniﬁed Processing Filter Uniﬁed Streaming Data Lakehouse

Union Join Uniﬁed Processing Filter Composable & Uniﬁed Processing

A modern data fabric for ML can beneﬁt from an
intelligent, distributed, yet intuitive optimization layer. https://zhenzhongxu.com/

Local/Single Machine Remote/Distributed Need an invisible interface to plug into
compute ecosystems

Streaming Leaning Batch Leaning Need an invisible interface to plug
into storage ecosystems

Data Fabric for a Streaming Pipeline

Data Fabric for a Unified Backfill Pipeline

Training dataset backfill requires point-in-time correctness Time Feature data Feature
data Feature data Prediction events Feature data

Point-in-time joins to generate training data 43 Proprietary & Confidential
Given a spine (entity keys + timestamp + label), join features to generate training data spine_df train_df = pitc_join_features( spine_df, features=[ "tx_max_1h", "user_unique_ip_30d", ], ) inference_ts tid cc_num user_id is_fraud 21:30 0122 2 1 0 21:40 0298 4 1 0 21:55 7539 6 3 1 inference_ts tid cc_num user_id is_fraud tx_max_1h user_unique_ip_30d 21:30 0122 2 1 1 … … 21:40 0298 4 1 1 … … 21:55 7539 6 3 3 … … ts cc_num tx_max_1h 9:20 2 … 10:24 2 … 20:00 4 … cc_num_tx_max_1h ts user_id unique_ip_30d 6:00 1 … 6:00 3 … 6:00 5 … user_unique_id_30d

[2023] 5-minute Practical Streaming Techniques ...

[2023] 5-minute Practical Streaming Techniques That Can Save You Millions

More Decks by Zhenzhong Xu

Featured

Transcript