Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Fast, Reliable, Yet Catastrophically Failing!?! Safely Avoiding Incidents When Putting Machine Learning Into Production

finid
June 28, 2019

Fast, Reliable, Yet Catastrophically Failing!?! Safely Avoiding Incidents When Putting Machine Learning Into Production

Safely releasing machine learning based services into production presents a host of challenges that even the most experienced SRE may not expect. Severe incidents with stable infrastructure, invisible errors rates, IMPROVING response times, but the business failing catastrophically losing millions of dollars? Absolutely!

As an operator of production systems now being increasingly asked to release and manage machine learning based systems, Welcome to ML in production, where everything you know about running, deploying, and monitoring systems is harder and riskier.

We’ll outline some severe outages seen in the wild, their causes, and detail how emergent cutting edge techniques from the DevOps and SRE world around “testing in prod”, progressive delivery, and deterministic simulation are the PERFECT solution for increasing safety, resilience, and confidence for SREs operating and managing ML based services at scale.

finid

June 28, 2019
Tweet

More Decks by finid

Other Decks in Technology

Transcript

  1. Big Data & AI Conference Dallas, Texas June 27 –

    29, 2019 www.BigDataAIconference.com
  2. WHO? ▸ Software Engineer ▸ Working on Data Science teams

    as the fool ▸ Exposed to “proper science” ▸ Put this model/data product into prod @rmn
  3. IT’S SLOW IT’S DOWN IT’S INTERMITTENTLY AVAILABLE IT’S DOING SOMETHING

    WEIRD IT’S MAKING SOMETHING ELSE ACT WEIRD @rm n TRADITIONAL CHARACTERISTICS OF AN INCIDENT
  4. EASY TO REASON ABOUT NOT ALWAYS EASY TO DEBUG CAN

    BE INSTRUMENTED CONVENTIONALLY FOR REDUCING MTTD, MTTR @rm n
  5. IT’S FASTER IT’S AVAILABLE IT’S STABLE IT’S DOING SOMETHING WEIRD

    IT’S MAKING SOMETHING ELSE ACT WEIRD @rm n CHARACTERISTICS OF ML INCIDENT
  6. DIFFICULT TO REASON ABOUT DIFFICULT TO DEBUG NEEDS DIFFERENT APPROACH

    TO OBSERVABILITY & INSTRUMENTATION TO IMPROVE DETECTION AND REDUCE SIZE AND DURATION OF INCIDENTS MTTR MUCH HARDER @rm n
  7. WE CAN’T ALWAYS IDENTIFY ANYTHING IS HAPPENING WHILE IT IS

    HAPPENING ONCE DETECTED WE CAN ONLY IDENTIFY WHAT HAPPENED AFTER MITIGATION @rm n
  8. BEHAVIORAL OUTAGES DATA DATA DATA DATA DATA DATA DATA DATA

    DATA DATA DATA DATA DATA DATA DATA REPLACES CODE @rm n
  9. IT’S STABLE PIPELINE JUNGLE STALE DATA WAS USED SO NOTHING

    CHANGED SERVING STALE, IRRELEVANT INFERENCES DIDN’T IMPROVE ANY KPI INCIDENT #1 @rm n
  10. IT’S FASTER TRAINED INCORRECTLY WITH UNSTABLE DATA DISTRIBUTION OF LABELS

    CHANGED MODEL IGNORED NEW INPUT AT INFERENCE TIME FASTER RESPONSE TIME HOORAY INCIDENT #2 @rm n
  11. IT’S STABLE NO AUTOMATION OR REPRODUCIBLE BUILD PIPLINE PRODUCTION ARTIFACT

    BUILT ON SCIENTISTS MACHINE WRONG ARTIFACT BUNDLED WRONG ASSIGNMENT IN MARKETPLACE BONUS INCIDENT: WHAT HAPPENED WHEN SCIENTIST LEFT COMPANY? @rm n INCIDENT #3
  12. IT’S FASTER EXPERIMENTAL CODE PATH INCORRECTLY IMPLEMENTED EVERYONE RECEIVED DEFAULT/FALLBACK

    DATA DEFAULT RECOMMENDATIONS FOR EVERYONE YAY! @rm n INCIDENT #4
  13. IT’S FASTER ENSEMBLE ONE “BAD” MODEL EXPECTING DATA OF SPECIFIC

    TYPE (FLOAT VS STRING) VERY HARD TO DEBUG SYSTEM SHOWED NO PROPERTIES OF INCORRECTNESS OR OUTAGE BASED ON SYSTEM PERFORMANCE METRICS THINGS WERE BAD!!!!! @rm n INCIDENT #5
  14. TEST IN PROD PROGRESSIVE DELIVERY ERROR BUDGETS @rm n 3

    CONCEPTS FROM PRODUCTION ENGINEERING AND SRE
  15. TEST IN PROD ▸ Stop: Go read/watch anything by Charity

    Majors (@mipsytipsy) and be enlightened ▸ Single handedly advanced this concept beyond a developer joke ▸ Attempting to clone production is foolish ▸ If you are small enough to clone, stay simple, if you are a big enough, attempting to clone production is foolish and waste of cycles ▸ “Real users, real trafc, real scale, real unpredictabilities” @rm n
  16. FEATURE FLAGS @rm n SEPARATE DEPLOY AND RELEASE TARGET SPECIFIC

    USERS FOR NEW “FEATURES” ABILITY TO TOGGLE EXPOSURE ON/OFF
  17. CANARY @rm n EXPOSE SOME % OF LIVE TRAFFIC TO

    A NEW SERVICE MONITOR KEY BUSINESS METRICS FOR THAT POPULATION A/B TEST OUTCOME OF NEW DEPLOYMENT WIDER RELEASE WHEN YOU ARE COMFORTABLE
  18. EXPERIMENT @rm n DELIBERATELY EXPLORE WEIRD BEHAVIOR TRY NEW THINGS

    INSIDE YOUR BUDGET ALLOW AN ACCIDENTAL “OVERAGE” OF SLA TO BE YOUR PLAYGROUND YOU HAVE HEADROOM TO TAKE RISKY CHANGES