Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[2023] 5-minute Practical Streaming Techniques ...

Zhenzhong Xu
November 16, 2023
50

[2023] 5-minute Practical Streaming Techniques That Can Save You Millions

Please read the blog post (https://medium.com/data-engineer-things/5-minute-practical-streaming-techniques-that-can-save-you-millions-6d6b49400308) companies this deck.

Companies are looking for ways to reduce streaming infrastructure costs in the current macroeconomic environment. However, this is a difficult task for two reasons. First, cutting costs without sacrificing latency or correctness requires a deep knowledge of engine implementation details and a keen eye to identify opportunities. Second, optimization techniques are less accessible when working with high-level language abstraction such as SQL, as these techniques are often coupled with engine query planning, which requires even deeper expertise. Many Data Engineers and Data Scientists prefer to avoid dealing with Intermediate Representations (IR) and optimization rules. They also may not care too deeply about the details of applying streaming watermarks to reduce the runtime complexity for Point-In-Time-Correct join queries.

In this talk, I will share some simple optimization techniques you can apply in just a few minutes with streaming SQL that can cut costs by 10x or even 100x. Then, we’ll gradually dive deeper into some novel optimization techniques that can be applied across your distributed storage and compute stacks.

By the end of this talk, if you are a Data Engineer or a Data Scientist who is looking to build real-time streaming workloads but has concerns about cost, I hope you’ll be able to walk away with some tricks so you can check that box on your product ROI OKR :) If you are a platform engineer, I hope you will learn how to apply optimization abstractions across various computing and storage engines in your platform.

Zhenzhong Xu

November 16, 2023
Tweet

Transcript

  1. Trust & safety fintech, cybersecurity, social media, ecommerce Use case

    drivers for streaming pipelines Dynamic pricing ecommerce, gaming Recommender system ecommerce, social media, entertainment Ads optimization ecommerce, social media In-the moment coordination Logistics, healthcare, customer support
  2. Data staleness, even by just one hour, costs money LinkedIn’s

    Real-time Anti-abuse (2022) Moving from an offline pipeline (hours) to real-time pipeline (minutes) led to +30% in bad actors caught online and +21% in fake account detection Instacart: The Journey to Real-Time Machine Learning (2022) Real-time pipeline directly reduces millions of fraud-related costs annually How WhatsApp catches and fights abuse (2022 | slides) A few 100ms delay can increase the spam by 20-30% How Pinterest Leverages Realtime User Actions to Boost Engagement (2022) One of our most impactful innovations recently, increasing Home feed engagement by 11% while reducing Pinner hide volume by 10%” Airbnb: Real-time Personalization using Embeddings for Search Ranking (2018) Moving from offline scoring to online scoring grows bookings by +5.1%
  3. reference: Open Problems in Stream Processing: A Call To Action,

    Tyler Akidau (2019) Correctness Low cost Low latency 1. Fast & Correct 2. Cheap & Correct 3. Fast & Cheap Optimization Goals
  4. Latency: the big picture 5 Service A Service B …

    Service X Tx DWH (Backfill) Online Store Data Processing Network: 10s ms Network latency: 10s ms Offline Store Online Serving latency: 10s ms Inference Training Offline Serving latency: 1-10s sec Computation latency: 10 ms to hours Network: 10s ms 1-10 mins Backfill Catch Up Latency: mins to hours In-memory DataFrame Local Experimentation
  5. Computation Latency • Latency Marker Latency = Sum(marker_latency_per_operator) + emission_cadence

    • Event time lag Event time lag = current processing time - current watermark Backfill Latency • Backfill latency is the time difference between when the user initiates back-processing until the pipeline catches up and stabilizes to the point that event time lag is within reasonable bound. • Watermark progression / wall clock time 6 How to measure Latencies
  6. Cost: the big picture 7 Service A Service B …

    Service X Tx DWH (Backfill) Online Store Data Processing Data Movement Network latency: 10s ms Offline Store Serving Training Computation Cost In-memory DataFrame Experimentation Storage Cost Storage Cost Data Movement Data Storage Data Storage Computation Cost Computation Cost
  7. Computation Processing: • Events ingestion speed and variance/spike pattern •

    Complexity of transformation/aggregation • Results emission frequency and percentage of data recomputation • High-level computation overhead such as serialization/deserialization • Low-level computation overhead such as disk <> memory <> CPU register <> instruction set bus utilization • Network latency between systems. State management: • Per event size • Window length • Total keyspace and in-state keyspace • IO latency in state store 8 Factors impacting cost
  8. Storage • Access pattern / Data structure • Volume/scale •

    Keyspace • Cold vs. hot storage Data Movement • Data close to compute vs data close to storage or both. • Data transmitted within the same VPC will be cheaper. Data transmitted over the internet backbone will be more expensive (cross-cloud provider, cross-region). • Whether data transfer requires TLS encryption. • Whether data transfer requires privacy governance. 9 Factors impacting cost
  9. Consider optimization knobs systematically To speed things up • Use

    more power, so you get to do more in less time • Process less, so you get to finish on time • Spend less time waiting, so you get to finish earlier • Use better execution plan, so you get work done faster and smarter To lower cost • Less redundant things, so you can spend less on compute and storage • Use the more cost effective hardware/technology, so you can push expensive cost to cheap ones
  10. Example: Fraud Detection Feature Optimization Scenario: • 10MM active credit

    cards • 30% Daily Active Cards • 1-2% Hourly Active Cards during peak • Active cards have an average of 10 swipes per day Simulation parameters: • 1000 transactions per second • 200k unique cards within an hour Field Type Description cc_num String Credit card number amt Float Transaction amount unix_time Int64 Unix timestamp of the transaction merchant String Merchant name zipcode String Zip code of the transaction location category String Category of the transaction city String City of the transaction location state String State of the transaction location
  11. Let’s compute: Average transaction amount from a twelve-months window till

    now, per user/credit card. We want the results emitted no later than one second after each transaction event is received. 12
  12. SELECT cc_num, AVG(amt) AS avg_amt FROM TABLE( HOP(TABLE transactions, DESCRIPTOR(unix_time),

    INTERVAL '1' SECONDS, INTERVAL '2' HOURS) ) GROUP BY cc_num, window_start, window_end 13
  13. SELECT cc_num, AVG(amt) AS avg_amt FROM TABLE( HOP(TABLE transactions, DESCRIPTOR(unix_time),

    INTERVAL '1' SECONDS, INTERVAL '2' HOURS) ) GROUP BY cc_num, window_start, window_end 14 Windowing Table-Valued-Function (Hop/Sliding windows)
  14. SELECT cc_num, AVG(amt) AS avg_amt FROM TABLE( HOP(TABLE transactions, DESCRIPTOR(unix_time),

    INTERVAL '1' SECONDS, INTERVAL '2' HOURS) ) GROUP BY cc_num, window_start, window_end 15 Windowing Table-Valued-Function (Hop/Sliding windows) cc 1 cc 2 cc 3
  15. Drill into window firings 19 time cc: 123 cc: 124

    …… Sliding Window 1 Window 2 Window 3
  16. Drill into State Access 20 { "cc": 123, "zip code":

    94040 } … …… time cc: 123 cc: 124 Ct: 4 Ct: 3 ……
  17. Challenges Due to Skew 21 { "cc": 123, "zip code":

    94040 } … …… time cc: 123 cc: 124 Ct: 1M Ct: 2 ……
  18. SELECT cc_num, AVG(amt) OVER w AS avg_amt FROM `transactions` WINDOW

    w AS ( PARTITION BY cc_num ORDER BY unix_time RANGE BETWEEN INTERVAL '2' HOURS PRECEDING AND CURRENT ROW ) 22 Over-aggregation window
  19. SELECT cc_num, AVG(amt) OVER w AS avg_amt FROM `transactions` WINDOW

    w AS ( PARTITION BY cc_num ORDER BY unix_time RANGE BETWEEN INTERVAL '2' HOURS PRECEDING AND CURRENT ROW ) 25 Alternatives?
  20. Sliding Window Over Aggregation Performance 95th latency (ms) 55986 74.4

    Mean lag (events) 22770 788 Mean emissions per second 179242 1003 Cost CPU cores 224 2 CPU Utilization (%) 90.70% 0.49% Amortized cost $8,435.91 $69.02 Correctness Point-in-time accuracy Reasonably Accurate Worse when changes are slower
  21. Sliding Window Over Aggregation Performance 95th freshness (ms) 55986 <-

    spectrum -> 74.4 Mean lag (events) 22770 788 Mean emissions per second 179242 1003 Cost CPU cores 224 2 CPU Utilization (%) 90.70% 0.49% Amortized cost $8,435.91 <- spectrum -> $69.02 Correctness Point-in-time accuracy Reasonably Accurate <- spectrum -> Worse when changes are slower within window bound As a user, what knobs do I have to make appropriate tradeoffs between computation freshness, cost, and accuracy?
  22. Sliding Window Over Aggregation Performance 95th freshness (ms) 55986 <-

    spectrum -> 74.4 Mean lag (events) 22770 788 Mean emissions per second 179242 1003 Cost CPU cores 224 2 CPU Utilization (%) 90.70% 0.49% Amortized cost $8,435.91 <- spectrum -> $69.02 Correctness Point-in-time accuracy Accurate <- spectrum -> Worse when changes are slower within window bound ( SELECT cc_num, AVG(amt) OVER w AS avg_amt FROM `transactions` WINDOW w AS ( PARTITION BY cc_num ORDER BY unix_time RANGE BETWEEN INTERVAL '2' HOURS PRECEDING AND CURRENT ROW) ) UNION ( SELECT cc_num, AVG(amt) AS avg_amt FROM TABLE(HOP(TABLE transactions, DESCRIPTOR(unix_time), INTERVAL '60' SECONDS, INTERVAL '2' HOURS)) GROUP BY cc_num, window_start, window_end ); Inaccuracy tolerance of 60 seconds Results always fresh after an triggering event
  23. Workload Compiler/Optimizer Deployment Relational Expression @transformation def transaction_count(tx: Transactions): return

    tx[tx.status == "failed"].groupby("account_id").rolling().count() Python/SQL Interchangeable Declarations
  24. Workload Declaration IR Online store Offline store Filter Scan Scan

    Union Join Unified Processing Filter Distributed Predicate Push Down/Up
  25. Workload Declaration IR Online store Offline store Filter Scan Scan

    Union Join Unified Processing Filter Unified Streaming Data Lakehouse
  26. Workload Declaration IR Online store Offline store Filter Scan Scan

    Union Join Unified Processing Filter Composable & Unified Processing
  27. A modern data fabric for ML can benefit from an

    intelligent, distributed, yet intuitive optimization layer. https://zhenzhongxu.com/
  28. Point-in-time joins to generate training data 43 Proprietary & Confidential

    Given a spine (entity keys + timestamp + label), join features to generate training data spine_df train_df = pitc_join_features( spine_df, features=[ "tx_max_1h", "user_unique_ip_30d", ], ) inference_ts tid cc_num user_id is_fraud 21:30 0122 2 1 0 21:40 0298 4 1 0 21:55 7539 6 3 1 inference_ts tid cc_num user_id is_fraud tx_max_1h user_unique_ip_30d 21:30 0122 2 1 1 … … 21:40 0298 4 1 1 … … 21:55 7539 6 3 3 … … ts cc_num tx_max_1h 9:20 2 … 10:24 2 … 20:00 4 … cc_num_tx_max_1h ts user_id unique_ip_30d 6:00 1 … 6:00 3 … 6:00 5 … user_unique_id_30d