[2022] Invisible Interfaces - Considerations for Abstracting Complexities of a Real-time ML Platform

Slide 1

Slide 1 text

Invisible Interfaces Zhenzhong Xu (@zhenzhongxu) Current 22 - Oct, 2022 Considerations for Abstracting Complexities of a Real-time ML Platform

Slide 2

Slide 2 text

The discovery of something invisible Ancient Greek name for amber: elektron Thales of Miletus

Slide 3

Slide 3 text

The endeavor to make it useful Ubiquitous Easy and responsive Just works! the invisible interface

Slide 4

Slide 4 text

About Zhenzhong Xu ● Building real-time ML platform @Claypot ● Real-time Data Infrastructure @ Netﬂix ● Cloud infra @ Microsoft

Slide 5

Slide 5 text

"There's been an explosion of ML use cases that … don't make sense if they aren't in real time. More and more people are doing ML in production, and most cases have to be streamed." Ali Ghodsi, Databricks CEO Fraud prevention Personalization Customer support Dynamic pricing Trending products Risk assessment Robotics Ads ETA Network analysis Sentiment analysis Object detection …

Slide 6

Slide 6 text

AIIA survey (2022) - https://ai-infrastructure.org/ai-infrastructure-ecosystem-report-of-2022/

Slide 7

Slide 7 text

Data Science Realtime ML Platform Data Infrastructure Exploration & Research Model Architecture & Turning Model Analysis & Selection Ingestion & Transport Security & Governance Multi-tenancy Isolation Data Sources Storage Query & Compute Business Decision Optimization Workﬂow Orchestration Analytics / Visualization

Slide 8

Slide 8 text

Model Serving Model Training Model Monitoring Model Evaluation Feature Materialization Label Materialization Data Monitoring Data Model Flow Data Flow Data Flow Data Flow Data Flow Product Ecosystem Analytics ecosystem

Slide 9

Slide 9 text

Data Loop Model Loop Challenge/Value Slow Slow Low freshness, low quality. Out-of-date models, predictions & trainings with stale data, model drift results in low model accuracy. Slow Fast Low freshness, low quality. Model training is bottlenecked by availability of fresh data. Prediction latency high or predicted with stale data. Fast Slow High freshness, low quality. Fresh data available for predictions, trainings, and observability. Slow model iteration results in out-of-date model, lower accuracy. Fast Fast High freshness, high quality. You want your ML ecosystem to be here. Combine your data and model loops: why you need both to be fast

Slide 10

Slide 10 text

Online Customer Service Use Case Example ● Suggest diagnostic runbook ● Proactive in-the-moment remediation action ● Fraud prevention vs detection

Slide 11

Slide 11 text

Deﬁne model features ● average transaction amount from past 14 days ● request channel encoding ● text embedding similarity score Data Scientists

Slide 12

Slide 12 text

What’s the appropriate level of complexity the ML platform should expose? ML Platforms: What’s preventing Ubiquitous?

Slide 13

Slide 13 text

DWH (Snowflake / BigQuery / S3) Predictions 1 Offline batch prediction ● Use cases: churn prediction, user LTV, risk planning, etc. 2 BI Batch job to generate predictions (e.g. Airflow + Spark)

Slide 14

Slide 14 text

App DWH Prediction requests Batch job to generate features Prediction service 3 Online prediction with batch features ● Batch features: computed ofﬂine, e.g. product embeddings ● Use cases: recsys KV store 4 For low-latency online access Write to ofﬂine 2 Batch features Write to online 1 2 Joined batch features

Slide 15

Slide 15 text

App DWH Prediction requests Batch job to generate features Prediction service 3 Online prediction with on-demand features ● Batch features: queried from transactional stores, e.g. # orders in the last 30 mins ● Use cases: recsys KV store 4 For low-latency online access Write to ofﬂine 2 Batch features Write to online 1 2 TX store (eg Postgres, Cassandra) Joined features 4 Transactions

Slide 16

Slide 16 text

App DWH Batch job to generate features Prediction service 3 Online prediction with streaming features ● Online features: computed online, ○ e.g. distance between two locations, count/percentile in the last 30 mins KV store Write to ofﬂine 2 Write to online 2 Real-time transport Logs Stream feature extraction Feature service 5 4 Batch features 1 4 Prediction requests

Slide 17

Slide 17 text

Combining ofﬂine and online data Time DWH Stream transaction behavior over the last 6 months T-7 days T-1 day to T-6 month

Slide 18

Slide 18 text

Combining ofﬂine and online data Time DWH Stream transaction behavior over the last 6 months T-7 days T-1 day to T-6 month Backﬁlling challenge

Slide 19

Slide 19 text

Backfill in Lambda Architecture Data Source In-motion Compute At-rest Compute Online Storage Offline Storage Online Query (serving) Mixed Query (backfill) Offline Query (training)

Slide 20

Slide 20 text

Backfill in Lambda Architecture Data Source In-motion Compute At-rest Compute Online Storage Offline Storage Online Query (serving) Mixed Query (backfill) Offline Query (training)

Slide 21

Slide 21 text

Backfill in Kappa Architecture Data Source In-motion Compute (Backﬁll from historical log) Materialized Views Online Query (serving) Ofﬂine Query (training) batch transformation streaming transformation

Slide 22

Slide 22 text

Slide 23

Slide 23 text

23 Unified Backfill Data Source In-motion Compute (intelligent backﬁll from dual sources) Materialized Views Online Query (serving) Ofﬂine Query (training) batch transformation streaming transformation DWH backed logs Orchestration & Governance

Slide 24

Slide 24 text

24 Abstracted Unified Backfill Data Source In-motion Compute (intelligent backﬁll from dual sources) Materialized Views Online Query (serving) Ofﬂine Query (training) batch transformation streaming transformation DWH backed logs Orchestration & Governance

Slide 25

Slide 25 text

Build model features ● Should I declare features in SQL or Python? ● How do I join existing intent classification results to my new feature ● What confidence can I get before checking in my code? Data Scientists

Slide 26

Slide 26 text

Does the ML platform speak the same language as the users? ML Platforms: What’s preventing easy and responsive?

Slide 27

Slide 27 text

Does the ML platform speak the same language as the users? Questions for ML Platforms: ● Can users express or declare what they need to control in a single coherent interface? ● Can the platform understand the intent and drive the underlying system? ● Can user and platform communicate interactively, in a timely fashion? ● Can the user understand their options and tradeoffs without reading a 300-pages manual? ● How much integration effort is needed to plug a model into existing data streams?

Slide 28

Slide 28 text

Online Prediction: Latency vs. Staleness Latency Request Prediction Feature computation Prediction retrieval Feature retrieval Prediction computation Raw data Staleness RT Feature NRT Feature Batch Feature Staleness No staleness* > secs > hours Latency Low (10s ms-1s sec) Lower (10s-100s ms) Lower (10s-100s ms) Footnote: *computation takes time, latency includes the computation time; Feature performance dependent on source technology and shared trafﬁc pattern.

Slide 29

Slide 29 text

What about tradeoffs? ● Three dimensions! ● Can choose 2! ● Have to be ﬂexible on the 3rd ● Need clean abstractions for full freedom Correctness Low cost Low latency 1. Fast & Correct 2. Cheap & Correct 3. Fast & Cheap reference: Open Problems in Stream Processing: A Call To Action, Tyler Akidau (2019)

Slide 30

Slide 30 text

Python vs SQL vs (Scala) ? vs

Slide 31

Slide 31 text

Python vs SQL? ≈

Slide 32

Slide 32 text

Python vs SQL? ≈ Intermediate representation (IR) Compute Engines

Slide 33

Slide 33 text

There is a catch! UDF…

Slide 34

Slide 34 text

Don’t invent a new language/DSL! Evolve existing ones to make it better.

Slide 35

Slide 35 text

Connector ecosystem is getting more mature. Nice, but what about event schema and envelope standards?

Slide 36

Slide 36 text

Deploy model features ● Should I duplicate the feature results in a different table? ● Which team do I need to inform about the change? ● Do I need to worry about training/prediction skew? Data Scientists

Slide 37

Slide 37 text

What symptoms are there indicating your platform is not trusted? ML Platforms: What doesn’t just work?

Slide 38

Slide 38 text

What symptoms are there indicating your platform is not trusted? ML Platforms: What doesn’t just work? ● My freedom and your responsibility ● Producer and consumer tension ● Users are forced to choose between basic requirements

Slide 39

Slide 39 text

Ofﬂine / Online consistencies Sharing and reusing Schema evolution SWE Practices

Slide 40

Slide 40 text

You are part of the endeavor to make real-time data useful! ● Ubiquitous ● Easy and responsive ● Just works! https://zhenzhongxu.com/ [email protected] the invisible interface