BUILDING PREDICTION API AS
A SERVICE
KYRYLO PEREVOZCHYKOV
Slide 2
Slide 2 text
WHO AM I
KYRYLO PEREVOZCHYKOV
‣ Easy-to-use Machine Learning service that
helps people building models and predict on
them
‣ My daily tasks split 70% - development, 30% -
operations.
‣ Calling myself software engineer
2
Slide 3
Slide 3 text
MACHINE LEARNING PIPELINE
MACHINE LEARNING AS A SERVICE PIPELINE
DATA INTAKE DATA
TRANSFORM BUILD MODEL DEPLOY
MODEL PREDICT
Slide 4
Slide 4 text
USE CASES
USE CASES
Inputs Application Outputs
Products
clicked
Product
recommendation
List of similar
products
Slide 5
Slide 5 text
USE CASES
USE CASES
Inputs Application Outputs
Item (house,
airline ticket)
Price intelligence
and forecasting
Price trend and
likelihood to
change
Slide 6
Slide 6 text
USE CASES
USE CASES
Inputs Application Outputs
Transaction
details
Fraud detection
Transaction
approved/
declined
Slide 7
Slide 7 text
USE CASES
USE CASES
Inputs Application Outputs
In-game
activity
Game
engagement
Change
difficulty in the
game
Slide 8
Slide 8 text
ML AS A SERVICE
CHALLENGES
▸ Time constraints — run and respond fast enough
▸ Needs to be robust and “just works”
Slide 9
Slide 9 text
NOT ALL TIMES ARE MATTER
▸ Time to respond
▸ Time to build and deploy
Slide 10
Slide 10 text
LATENCY FOR PREDICTIONS
WHAT IS FAST? (BABY DON’T HURT ME)
For what Good enough
Recommendations <300 ms
Ads, click-through rates <70 ms
Fraud detection <30 ms
Slide 11
Slide 11 text
MODEL.PREDICT(DATA)
Slide 12
Slide 12 text
PREDICTION SERVER
SERVER
CLIENT
CPU
RAM
DB FILESYSTEM
HTTP (NETWORK IO)
Slide 13
Slide 13 text
PREDICTION API AS A SERVICE
BOTTLENECKS EVERYWHERE
▸ Distributed FS are slow and unreliable
▸ Deserializing model and load it to RAM is expensive
▸ model.predict(data) is CPU-bound operation
▸ model can be heavy 1Gb object.
Slide 14
Slide 14 text
PREDICTION WORKERS
PYTHON. PAIN.
▸ One (same) 1gb models takes 16gb in 16 process.
▸ Ten different models takes 10gb * 16 = 160 Gb.
Slide 15
Slide 15 text
CUSTOMER RULE
▸ Limit to 10 models at once is not acceptable by some of
the customers
Slide 16
Slide 16 text
POSSIBLE SOLUTIONS
▸ Horizontal scale with hash ring
▸ Benefits
▸ Custom routing (good UX)
▸ Gaps
▸ Complexity is growing (operations effort)
▸ Problem with utilization on opposite use case — can’t
scale 1 model to N servers
Slide 17
Slide 17 text
POSSIBLE SOLUTIONS
▸ Flatten your models
▸ Simplify serializing format (use something different to
pickle)
Slide 18
Slide 18 text
ZERO DOWNTIME DEPLOYMENT AND SCALING
▸ Blue-green deployment strategy with cache warming
▸ AMIs and autoscaling (for fast autoscaling)
▸ K8s is not cost effective
Slide 19
Slide 19 text
MONITORING PREDICTIONS
▸ Protect against model deterioration
▸ Alert on deteriorations or functional violations (500, high
latency)
▸ Retire and replace an important model