Reducing Pager Fatigue with a Serverless ML Bot

Being woken up at 3 am by the pager is never fun but seeing an incident resolve before you’ve even left the bed is maddening. Sleepily the next day you tune the alert for a better night’s sleep yet more untuned alerts sing to you in your sleep. After a few rounds of alert-tuning whack-a-mole you wonder: Could I predict if an incident will resolve itself?

This is the story of how a weary engineer used a Cloud ML model with Cloud Functions to reduce pager noise. Recounting some of the challenges faced, we’ll explore training a model with a limited data set & continual training in a serverless environment. We’ll also explore the implications of using a bot as a first responder to a pager.


Mike Fowler

October 01, 2019


  The Problem Many PagerDuty incidents

    resolve before I respond disrupting my sleep needlessly
    2 timestamps 6 numeric 15 text features 20678 samples
    1 timestamp 1 numeric 3 text features 19354 samples
    • Worked examples starting from simple number manipulations to complex processes such as principal component analysis (PCA) • Lots of Python code and decent explanations • Primarily scikit-learn • Decent bibliography per chapter
    2185 positive class 2185 negative class 47 features 4370 samples
    • Hosted Jupyter notebooks • Distributable training with automatic resource provisioning • Supports CPUs, GPUs and TPUs • Run across many nodes and multiple experiments • Automated hyperparameter tuning with HyperTune • Exportable models • Model hosting for online prediction
    from sklearn.ensemble import RandomForestClassifier from sklearn.externals import joblib model = RandomForestClassifier(n_estimators=n) ... model.predict = model.predict_proba joblib.dump(model, 'model.joblib')
    $ gsutil cp ./model.joblib gs://your-bucket/model.joblib NB: The directory containing the model must be 250MB or less
    $ gcloud ai-platform models create mrdata
    $ gcloud ai-platform versions create v1 --model mrdata --origin gs://your-bucket/model.joblib --runtime-version=1.14 --framework SCIKIT_LEARN --python-version=3.5
    - Node.js 6, 8 & 10 - Python 3.7.1 - Go 1.11.6 • Triggerable from: - HTTP - Cloud Storage - Pub/Sub - Cloud Scheduler
    import ( "net/http" ) func PagerDuty(w http.ResponseWriter, r *http.Request) { //awesome code }
    $ gcloud iam service-accounts create mrdata --display-name "Mr Data's Service Account" $ gcloud beta projects add-iam-policy-binding myproject --member serviceAccount:mrdata@myproject.iam.gserviceaccount.com --role roles/ml.developer
    $ gcloud beta functions deploy mrdata --entry-point PagerDuty --runtime go111 --trigger-http --service-account mrdata@myproject.iam.gserviceaccount.com
    • Publish/Subscribe messaging service • At-least-once delivery • Seek & Replay - A subscription only sees from after it was created
    $ gcloud pubsub topics create pd-notify $ gcloud pubsub subscriptions create --topic pd-notify pd-notify-model $ gcloud pubsub subscriptions create --topic pd-notify pd-notify-firestore
    • Serverless NoSQL document database • ACID transactions • Automatic scaling & indexing • Multi-region replication • Client libraries provide live and offline synchronization
    type Prediction struct { Incident string `firestore:"incident"` Prediction bool `firestore:"prediction"` Confidence float64 `filestore:"confidence"` }
    pred := Prediction{ Incident: "PRX7NJU", Prediction: true, Confidence: 0.7677249, } _, err := client.Collection("predictions") .Doc("PRX7NJU") .Set(ctx, pred)
    Mike Fowler mlfowler gh-mlfowler @mlfowler_ www.mlfowler.com mike.fowler@claranet.uk
