Safely releasing machine learning based services into production presents a host of challenges that even the most experienced SRE may not expect. Severe incidents with stable infrastructure, invisible errors rates, IMPROVING response times, but the business failing catastrophically losing millions of dollars? Absolutely!
As an operator of production systems now being increasingly asked to release and manage machine learning based systems, Welcome to ML in production, where everything you know about running, deploying, and monitoring systems is harder and riskier.
We’ll outline some severe outages seen in the wild, their causes, and detail how emergent cutting edge techniques from the DevOps and SRE world around “testing in prod”, progressive delivery, and deterministic simulation are the PERFECT solution for increasing safety, resilience, and confidence for SREs operating and managing ML based services at scale.