Fast, Reliable, Yet Catastrophically Failing!?! Safely Avoiding Incidents When Putting Machine Learning Into Production

E7d6e390a90513756419be75a43609ca?s=47 finid
June 28, 2019

Fast, Reliable, Yet Catastrophically Failing!?! Safely Avoiding Incidents When Putting Machine Learning Into Production

Safely releasing machine learning based services into production presents a host of challenges that even the most experienced SRE may not expect. Severe incidents with stable infrastructure, invisible errors rates, IMPROVING response times, but the business failing catastrophically losing millions of dollars? Absolutely!

As an operator of production systems now being increasingly asked to release and manage machine learning based systems, Welcome to ML in production, where everything you know about running, deploying, and monitoring systems is harder and riskier.

We’ll outline some severe outages seen in the wild, their causes, and detail how emergent cutting edge techniques from the DevOps and SRE world around “testing in prod”, progressive delivery, and deterministic simulation are the PERFECT solution for increasing safety, resilience, and confidence for SREs operating and managing ML based services at scale.

E7d6e390a90513756419be75a43609ca?s=128

finid

June 28, 2019
Tweet

Transcript

  1. None
  2. Big Data & AI Conference Dallas, Texas June 27 –

    29, 2019 www.BigDataAIconference.com
  3. SAFELY AVOIDING INCIDENTS WHEN PUTTING ML INTO PRODUCTION ‣ fast,

    reliable, catastrophically failing?
  4. WHO? ▸ Software Engineer ▸ Working on Data Science teams

    as the fool ▸ Exposed to “proper science” ▸ Put this model/data product into prod @rmn
  5. @rm n

  6. WHAT ARE WE TALKING ABOUT @rm n

  7. MODELS IN PRODUCTION SEVERE OUTAGES VOCABULARY FOR THINKING ABOUT OPERATION

    @rm n
  8. YOU SOFTWARE CHANGES THE ENVIRONMENT CHANGES @rm n 2 THREATS

    TO AVAILABILITY
  9. IT’S SLOW IT’S DOWN IT’S INTERMITTENTLY AVAILABLE IT’S DOING SOMETHING

    WEIRD IT’S MAKING SOMETHING ELSE ACT WEIRD @rm n TRADITIONAL CHARACTERISTICS OF AN INCIDENT
  10. EASY TO REASON ABOUT NOT ALWAYS EASY TO DEBUG CAN

    BE INSTRUMENTED CONVENTIONALLY FOR REDUCING MTTD, MTTR @rm n
  11. WE LEARN ABOUT AND IDENTIFY WHAT IS HAPPENING WHILE IT

    IS HAPPENING @rm n
  12. IT’S FASTER IT’S AVAILABLE IT’S STABLE IT’S DOING SOMETHING WEIRD

    IT’S MAKING SOMETHING ELSE ACT WEIRD @rm n CHARACTERISTICS OF ML INCIDENT
  13. DIFFICULT TO REASON ABOUT DIFFICULT TO DEBUG NEEDS DIFFERENT APPROACH

    TO OBSERVABILITY & INSTRUMENTATION TO IMPROVE DETECTION AND REDUCE SIZE AND DURATION OF INCIDENTS MTTR MUCH HARDER @rm n
  14. WE CAN’T ALWAYS IDENTIFY ANYTHING IS HAPPENING WHILE IT IS

    HAPPENING ONCE DETECTED WE CAN ONLY IDENTIFY WHAT HAPPENED AFTER MITIGATION @rm n
  15. BEHAVIORAL OUTAGES DATA DATA DATA DATA DATA DATA DATA DATA

    DATA DATA DATA DATA DATA DATA DATA REPLACES CODE @rm n
  16. DATA REPLACES CODE @rm n

  17. IT’S STABLE PIPELINE JUNGLE STALE DATA WAS USED SO NOTHING

    CHANGED SERVING STALE, IRRELEVANT INFERENCES DIDN’T IMPROVE ANY KPI INCIDENT #1 @rm n
  18. IT’S FASTER TRAINED INCORRECTLY WITH UNSTABLE DATA DISTRIBUTION OF LABELS

    CHANGED MODEL IGNORED NEW INPUT AT INFERENCE TIME FASTER RESPONSE TIME HOORAY INCIDENT #2 @rm n
  19. IT’S STABLE NO AUTOMATION OR REPRODUCIBLE BUILD PIPLINE PRODUCTION ARTIFACT

    BUILT ON SCIENTISTS MACHINE WRONG ARTIFACT BUNDLED WRONG ASSIGNMENT IN MARKETPLACE BONUS INCIDENT: WHAT HAPPENED WHEN SCIENTIST LEFT COMPANY? @rm n INCIDENT #3
  20. IT’S FASTER EXPERIMENTAL CODE PATH INCORRECTLY IMPLEMENTED EVERYONE RECEIVED DEFAULT/FALLBACK

    DATA DEFAULT RECOMMENDATIONS FOR EVERYONE YAY! @rm n INCIDENT #4
  21. IT’S FASTER ENSEMBLE ONE “BAD” MODEL EXPECTING DATA OF SPECIFIC

    TYPE (FLOAT VS STRING) VERY HARD TO DEBUG SYSTEM SHOWED NO PROPERTIES OF INCORRECTNESS OR OUTAGE BASED ON SYSTEM PERFORMANCE METRICS THINGS WERE BAD!!!!! @rm n INCIDENT #5
  22. SO NOW WHAT? @rm n

  23. FROM CORRECTNESS TO SAFETY @rm n MINDSET SHIFT

  24. TEST IN PROD PROGRESSIVE DELIVERY ERROR BUDGETS @rm n 3

    CONCEPTS FROM PRODUCTION ENGINEERING AND SRE
  25. TEST IN PROD DOESN’T MEAN RELEASE WITHOUT TESTING @rm n

  26. TESTING IN PROD MEANS EXTENDING THE SOFTWARE DEVELOPMENT LIFECYCLE BEYOND

    RELEASE @rmn
  27. TEST IN PROD ▸ Stop: Go read/watch anything by Charity

    Majors (@mipsytipsy) and be enlightened ▸ Single handedly advanced this concept beyond a developer joke ▸ Attempting to clone production is foolish ▸ If you are small enough to clone, stay simple, if you are a big enough, attempting to clone production is foolish and waste of cycles ▸ “Real users, real trafc, real scale, real unpredictabilities” @rm n
  28. PROGRESS IVE @rm n

  29. “PROGRESSIVE DELIVERY IS CONTINUOUS DELIVERY WITH FINE- GRAINED CONTROL OVER

    THE BLAST James Governor, RedMonk (@monkchips)
  30. FEATURE FLAGS @rm n SEPARATE DEPLOY AND RELEASE TARGET SPECIFIC

    USERS FOR NEW “FEATURES” ABILITY TO TOGGLE EXPOSURE ON/OFF
  31. CANARY @rm n EXPOSE SOME % OF LIVE TRAFFIC TO

    A NEW SERVICE MONITOR KEY BUSINESS METRICS FOR THAT POPULATION A/B TEST OUTCOME OF NEW DEPLOYMENT WIDER RELEASE WHEN YOU ARE COMFORTABLE
  32. ERROR BUDGETS @rm n

  33. YOUR JOB ISN’T TO OPERATE INFINITELY RELIABLE SOFTWARE GO ON…TELL

    YOUR BOSS @rm n
  34. YOU MIGHT HAVE SOME 9S TO PLAY WITH @rm n

  35. EXPERIMENT @rm n DELIBERATELY EXPLORE WEIRD BEHAVIOR TRY NEW THINGS

    INSIDE YOUR BUDGET ALLOW AN ACCIDENTAL “OVERAGE” OF SLA TO BE YOUR PLAYGROUND YOU HAVE HEADROOM TO TAKE RISKY CHANGES
  36. SOLUTION TO COMPLEXITY IS NOT SIMPLICITY @rm n ACCEPT

  37. FROM CORRECTNESS TO SAFETY @rm n MINDSET SHIFT