Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reducing Pager Fatigue with a Serverless ML Bot

Reducing Pager Fatigue with a Serverless ML Bot

Being woken up at 3 am by the pager is never fun but seeing an incident resolve before you’ve even left the bed is maddening. Sleepily the next day you tune the alert for a better night’s sleep yet more untuned alerts sing to you in your sleep. After a few rounds of alert-tuning whack-a-mole you wonder: Could I predict if an incident will resolve itself?

This is the story of how a weary engineer used a Cloud ML model with Cloud Functions to reduce pager noise. Recounting some of the challenges faced, we’ll explore training a model with a limited data set & continual training in a serverless environment. We’ll also explore the implications of using a bot as a first responder to a pager.

F1e0e0c3c3196a63c9b17a2344fb6a61?s=128

Mike Fowler

October 01, 2019
Tweet

Transcript

  1. October 1st 2019 @mlfowler_ @Claranet Reducing Pager Fatigue using a

    Serverless ML Bot Mike Fowler - Senior Site Reliability Engineer - Public Cloud Practice PLACE CUSTOMER LOGO HERE
  2. London PostgreSQL Meetup January 24th 2019 About Me

  3. October 1st 2019 @mlfowler_ @Claranet I Like to Think I

    Know Data Source: https://peakcare.wordpress.com/2011/10/05/heads-in-the-sand/ https://i.pinimg.com/originals/cb/32/5f/cb325f9c268bf2135125f512d95
  4. October 1st 2019 @mlfowler_ @Claranet Scene: Our Engineer Rests Peacefully

    Source: https://peakcare.wordpress.com/2011/10/05/heads-in-the-sand/ https://i.pinimg.com/originals/cb/32/5f/cb325f9c268bf2135125f512d95
  5. October 1st 2019 @mlfowler_ @Claranet Scene: Red Alert! Source: https://peakcare.wordpress.com/2011/10/05/heads-in-the-sand/

    https://vignette.wikia.nocookie.net/memoryalpha/images/6/6b/RedAlert.jpg/revision/latest?cb=20100117050244&path-prefix=en
  6. October 1st 2019 @mlfowler_ @Claranet Scene: Peace Source: https://peakcare.wordpress.com/2011/10/05/heads-in-the-sand/ https://www.lakelouiseinn.com/wp-content/uploads/2019/01/LakeLouise2-1.jpg

  7. October 1st 2019 @mlfowler_ @Claranet Identify a Problem to Solve

  8. October 1st 2019 @mlfowler_ @Claranet The Problem Many PagerDuty incidents

    resolve before I respond disrupting my sleep needlessly
  9. October 1st 2019 @mlfowler_ @Claranet First Approach: Tuning https://i.ytimg.com/vi/5ZfaNpKvg5w/maxresdefault.jpg

  10. October 1st 2019 @mlfowler_ @Claranet Second Approach: Machine Learning https://irishtechnews.ie/wp-content/uploads/2017/07/chi-inc-robots-doing-more-office-work-bsi-hub-20150617.jpg

  11. October 1st 2019 @mlfowler_ @Claranet The Shape of Data 23

    2 timestamps 6 numeric 15 text features 20678 samples
  12. October 1st 2019 @mlfowler_ @Claranet The Shape of Usable Data

    5 1 timestamp 1 numeric 3 text features 19354 samples
  13. October 1st 2019 @mlfowler_ @Claranet Losing Hope https://io9.gizmodo.com/the-exact-moment-when-battlestar-galactica-won-our-hear-1670313315

  14. October 1st 2019 @mlfowler_ @Claranet Feature Engineering • Worked examples

    starting from simple number manipulations to complex processes such as principal component analysis (PCA) • Lots of Python code and decent explanations • Primarily scikit-learn • Decent bibliography per chapter
  15. October 1st 2019 @mlfowler_ @Claranet The Shape of Usable Data

    47 2185 positive class 2185 negative class features 4370 samples
  16. October 1st 2019 @mlfowler_ @Claranet Choosing a Model: Random Forest

    https://miro.medium.com/max/2612/0*f_qQPFpdofWGLQqc.png
  17. October 1st 2019 @mlfowler_ @Claranet Validating the Model: Cross Validation

    https://towardsdatascience.com/cross-validation-explained-evaluating-estimator-performance-e51e5430ff85
  18. October 1st 2019 @mlfowler_ @Claranet The Model 79% Accuracy of

  19. October 1st 2019 @mlfowler_ @Claranet The Model is only a

    Keystone https://io9.gizmodo.com/https://miro.medium.com/max/3036/1*SQg9Buf5w-rR2T8vCIVy3g.jpeg
  20. October 1st 2019 @mlfowler_ @Claranet Going Serverless https://datacenterfrontier.com/wp-content/uploads/2017/11/equinix-dc12-data-hall.jpg

  21. October 1st 2019 @mlfowler_ @Claranet A Basic Serverless Architecture

  22. October 1st 2019 @mlfowler_ @Claranet Introducing Mr Data

  23. October 1st 2019 @mlfowler_ @Claranet Introducing Mr Data

  24. October 1st 2019 @mlfowler_ @Claranet Introducing Mr Data

  25. October 1st 2019 @mlfowler_ @Claranet Introducing Mr Data

  26. October 1st 2019 @mlfowler_ @Claranet AI Platform • Hosted Jupyter

    notebooks • Distributable training with automatic resource provisioning • Supports CPUs, GPUs and TPUs • Run across many nodes and multiple experiments • Automated hyperparameter tuning with HyperTune • Exportable models • Model hosting for online prediction
  27. October 1st 2019 @mlfowler_ @Claranet Exporting a scikit-learn Model from

    sklearn.ensemble import RandomForestClassifier from sklearn.externals import joblib model = RandomForestClassifier(n_estimators=n) ... model.predict = model.predict_proba joblib.dump(model, 'model.joblib')
  28. October 1st 2019 @mlfowler_ @Claranet Deploying a Model: Upload $

    gsutil cp ./model.joblib gs://your-bucket/model.joblib NB: The directory containing the model must be 250MB or less
  29. October 1st 2019 @mlfowler_ @Claranet Deploying a Model: Create a

    Model $ gcloud ai-platform models create mrdata
  30. October 1st 2019 @mlfowler_ @Claranet Deploying a Model: Create a

    Version $ gcloud ai-platform versions create v1 --model mrdata --origin gs://your-bucket/model.joblib --runtime-version=1.14 --framework SCIKIT_LEARN --python-version=3.5
  31. October 1st 2019 @mlfowler_ @Claranet Cloud Functions • Function-as-a-Service supporting:

    - Node.js 6, 8 & 10 - Python 3.7.1 - Go 1.11.6 • Triggerable from: - HTTP - Cloud Storage - Pub/Sub - Cloud Scheduler
  32. October 1st 2019 @mlfowler_ @Claranet A Go Cloud Function import

    ( "net/http" ) func PagerDuty(w http.ResponseWriter, r *http.Request) { //awesome code }
  33. October 1st 2019 @mlfowler_ @Claranet Function Deployment Preliminaries $ gcloud

    iam service-accounts create mrdata --display-name "Mr Data's Service Account" $ gcloud beta projects add-iam-policy-binding myproject --member serviceAccount:mrdata@myproject.iam.gserviceaccount.com --role roles/ml.developer
  34. October 1st 2019 @mlfowler_ @Claranet Function Deployment $ gcloud beta

    functions deploy mrdata --entry-point PagerDuty --runtime go111 --trigger-http --service-account mrdata@myproject.iam.gserviceaccount.com
  35. October 1st 2019 @mlfowler_ @Claranet Needs Improvement http://www.bellabeachproperties.com/wp-content/uploads/2014/04/house-falling-apart-1.jpg

  36. October 1st 2019 @mlfowler_ @Claranet Needs Improvement

  37. October 1st 2019 @mlfowler_ @Claranet Improved Serverless Architecture

  38. October 1st 2019 @mlfowler_ @Claranet Cloud Pub/Sub • Publish/Subscribe messaging

    service • At-least-once delivery • Seek & Replay - A subscription only sees from after it was created
  39. October 1st 2019 @mlfowler_ @Claranet Topic & Subscription Creation $

    gcloud pubsub topics create pd-notify $ gcloud pubsub subscriptions create --topic pd-notify pd-notify-model $ gcloud pubsub subscriptions create --topic pd-notify pd-notify-firestore
  40. October 1st 2019 @mlfowler_ @Claranet Cloud Firestore • Serverless NoSQL

    document database • ACID transactions • Automatic scaling & indexing • Multi-region replication • Client libraries provide live and offline synchronization
  41. October 1st 2019 @mlfowler_ @Claranet A Simple struct for Recording

    Inferences type Prediction struct { Incident string `firestore:”incident”` Prediction bool `firestore:”prediction”` Confidence float64 `filestore:”confidence”` }
  42. October 1st 2019 @mlfowler_ @Claranet Adding a Document to a

    Collection pred := Prediction{ Incident: “PRX7NJU”, Prediction: true, Confidence: 0.7677249, } _, err := client.Collection(“predictions”) .Doc(“PRX7NJU”) .Set(ctx, pred)
  43. October 1st 2019 @mlfowler_ @Claranet Reporting

  44. October 1st 2019 @mlfowler_ @Claranet Towards Continual Training & Deployment

  45. October 1st 2019 @mlfowler_ @Claranet Diagnostics

  46. October 1st 2019 @mlfowler_ @Claranet What Next? https://nerdist.com/article/star-trek-picard-data-where-he-is-now/

  47. October 1st 2019 @mlfowler_ @Claranet Fin

  48. October 1st 2019 @mlfowler_ @Claranet Mike Fowler mlfowler Questions ?

    gh-mlfowler @mlfowler_ www.mlfowler.com mike.fowler@claranet.uk
  49. None