Slide 1

Slide 1 text

Understanding and Extending Prometheus AlertManager Lee Calcote calcotestudios.com/talks

Slide 2

Slide 2 text

Lee Calcote linkedin.com/in/leecalcote @lcalcote blog.gingergeek.com [email protected] clouds, containers, infrastructure, applications and their management calcotestudios.com/talks

Slide 3

Slide 3 text

Show of Hands

Slide 4

Slide 4 text

AlertManager Prometheus

Slide 5

Slide 5 text

is an alert... Alertmanager @lcalcote Purpose ingester grouper de-duplicator silencer throttler notifier

Slide 6

Slide 6 text

Receivers \ˈnō-mən-ˌklā-chər a brief Prometheus AlertManager construct review match alerts to their receiver and how often to notify where and how to send alerts Routes @lcalcote

Slide 7

Slide 7 text

- matches alerts with specific labels and prevents them from being included in notifications. - suppress specific notifications when other specific alerts are already firing. - categorizes alerts of similar nature into a single notification. Silencers Inhibitors Grouping \ˈnō-mən-ˌklā-chər a brief Prometheus AlertManager construct review Muting Suppressing Correlating group_wait: 30s group_by: ['alertname', 'cluster'] group_interval: 5m @lcalcote

Slide 8

Slide 8 text

Inhibition Multiple approaches to suppression @lcalcote repeat_interval vs Silences vs per route global via ui / api

Slide 9

Slide 9 text

Alerts ALERT IF FOR LABELS { ... } ANNOTATIONS { ... } Supports clients other than Prometheus is notified when alerts transition state @lcalcote a shared construct Prometheus AlertManager inactive firing pending state transition inactive firing notifications !

Slide 10

Slide 10 text

Notification Integrations @lcalcote

Slide 11

Slide 11 text

Notifying to Multiple Destinations Use to advance to next receiver. continue route: receiver: email_webhook receivers: - name: email_webhook email_configs: - to: '[email protected]' webhook_configs: - url: Use a that goes to both destinations. receiver route: receiver: ops-team-all # default routes: - match: severity: page receiver: ops-team-b continue: true - match: severity: critical receiver: ops-team-a receivers: - name: ops-team-all email_configs: - to: [email protected] - name: ops-team-a email_configs: - to: [email protected] - name: ops-team-b email_configs: - to: [email protected] or @lcalcote

Slide 12

Slide 12 text

Inhibitor Dispatcher Non-HA AlertManager Architecture Silencer de-duplication Dispatcher sorts incoming alerts into aggregation groups and assigns the correct notifiers to each. api Alert Provider UI Silence Provider store de-duplication subscribe Router batched alerts notification pipeline Notify Provider checks for previously sent notifications Retry Retry Maintenance Script ! @lcalcote alerts

Slide 13

Slide 13 text

@lcalcote High Availability being introduced in 0.5 I gossip protocols. built atop Weave Mesh With HA, you no longer have to monitor the monitor. Designed for an alert to be sent to all instances in the cluster. All Prometheus instances send alerts to all Alertmanager instances. Guarantees notifications to be sent at least once. @lcalcote

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

AlertManager UI @lcalcote

Slide 16

Slide 16 text

@lcalcote

Slide 17

Slide 17 text

Story: As an Operator, I would like to not only see a list of firing alerts, but also a list of all transpired alerts, so that I may have additional context as the thresholding behavior for a given defined alert. @lcalcote Prologue: Alert troubleshooting is improved when operators have a view of what is firing, has recently fired, what is normal, but also go back in time and see what fired an hour ago. Understanding firing order assists in root cause analysis and identify problem areas. Limitations: 1. AlertManager database (SQLite) is not intended to provide long-term storage. Acceptance Criteria: 1. Once fired, whether actively firing or not, alerts will be displayed on the History page. 2. Optionally, fired alerts will be notified to a Slack channel. Stretch: Include pagination Add a date range picker Add a host filter

Slide 18

Slide 18 text

Environment test setup

Slide 19

Slide 19 text

Random Sample Targets $ git clone https://github.com/prometheus/client_golang.git $ cd client_golang/examples/random $ go get -d $ go build Fetch and compile the client library code example. Start example targets in separate terminals. $ ./random -listen-address=:8080 $ ./random -listen-address=:8081 $ ./random -listen-address=:8082 Be sure to create and run the and point it at your soon-to-be AlertManager: random sample targets @lcalcote

Slide 20

Slide 20 text

Prometheus and Alert Rules Setup Follow the to download, configure and run Prometheus. getting started instructions $ ./prometheus -config.file=prometheus.yml -alertmanager.url=http://localhost:9093 ALERT instance_down IF up == 0 FOR 5s LABELS {severity="page"} ANNOTATIONS { DESCRIPTION="{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 seconds.", SUMMARY="Instance {{$labels.instance}} down"} /alert.rules A simple alert rule that will fire when any given target is unreachable for longer than 5 seconds. @lcalcote ! ... # Load and evaluate rules in this file every 'evaluation_interval' seconds. rule_files: - "alert.rules" ... /prometheus.yml

Slide 21

Slide 21 text

Environment development setup

Slide 22

Slide 22 text

@lcalcote Grab Repos $ git clone https://github.com/prometheus/alertmanager.git Given that our user story includes making front-end changes to AlertManager, ensure that you install a small utility to generate Go code from any file. Clone AlertManager repo Get, build and copy go-bindata into any directory on your PATH $ go get -u github.com/jteeuwen/go-bindata/... $ cd $GOPATH/src/github.com/jteeuwen/go-bindata/go-bindata $ go build

Slide 23

Slide 23 text

Notification Integration create an alert notification receiver. route: group_by: [cluster] # If an alert isn't caught by a route, send it slack. receiver: slack_general routes: # Send severity=slack alerts to slack. - match: severity: page receiver: slack_general receivers: - name: slack_general slack_configs: - api_url: '' channel: '#' send_resolved: true Of the supported AlertManager receivers, let’s opt for integrating Slack. @lcalcote

Slide 24

Slide 24 text

@lcalcote The can assist in building routing trees. visual editor

Slide 25

Slide 25 text

Build, Run, Test Verify you have a functional development environment by building and running the project: $ make assets # invokes go-bindata to inject static web files $ go build # compiles go code $ ./alertmanager -config.file=slack.yml # runs alertmanager with the specified configuration @lcalcote $ curl -X POST http://localhost:9090/-/reload $ kill -HUP `pgrep alertmanager` $ ./promtool check-config $ ./promtool check-rules Reload Prometheus or AlertManager configs Validate Prometheus config and alert rules

Slide 26

Slide 26 text

@lcalcote Test If you choose to setup a Slack channel, you should now see new alerts firing as and when your random targets go up and down.

Slide 27

Slide 27 text

/ui/app/js/app.js Changelog /api.go /ui/app/partials/history.html Angular HTML Go Go & SQL /provider/provider.go /provider/sqlite/sqlite.go /provider/boltmem/boltmem.go

Slide 28

Slide 28 text

@lcalcote All UI functionality should be addressable via API. Let’s register a : /api.go new /history API endpoint r.Get("/history", ihf("history", api.listAllAlerts)) func (api *API) listAllAlerts(w http.ResponseWriter, r *http.Request) { alerts := api.alerts.GetAll() defer alerts.Close() With our /api/v1/history endpoint a newly addressable API endpoint, we’ll need to build a function to handle requests made to it. The function will handle inbound HTTP requests made to the new endpoint. api.listAllAlerts

Slide 29

Slide 29 text

@lcalcote 1. Add (e.g. GetAll() AlertIterator) to /provider/provider.go 2. Add a to /provider/sqlite/sqlite.go 3. Add a to /provider/boltmem/boltmem.go a new AlertIterator new AlertProvider and SQL query new AlertIterator and AlertProvider With API endpoint, let’s turn our attention to the backend for collecting the right recordset from our data provider. /provider

Slide 30

Slide 30 text

@lcalcote /ui/app/js/app.js angular.module('am.controllers').controller('NavCtrl', function($scope, $location) { $scope.items = [{ name: 'History', url: 'history' }, angular.module('am.services').factory('History', function($resource) { return $resource('', {}, { 'query': { method: 'GET', url: 'api/v1/history' } }); } ); NavCtrl for the : History menu item as well as a : new History service angular.module('am.controllers').controller('HistoryCtrl', function($scope, History) { $scope.refresh = function () { History.query({}, function(data) { $scope.groups = data.data; console.log($scope.groups); }, function(data) { console.log(data.data); }) } $scope.refresh(); } ); and a : new History controller angular.module('am.directives').directive('history', function() { return { restrict: 'E', scope: { alert: '=', group: '=' }, templateUrl: 'app/partials/history.html' }; } ); Insert a : new History directive

Slide 31

Slide 31 text

@lcalcote Finally, we’ll need a page in which to view the transpired alerts. So, create a new file, , under /ui/app/partials. history.html History.html will simply format the display a tabular recordset. A new recordset will be retrieved from our data provider. /ui/app/partials/history.html

Slide 32

Slide 32 text

@lcalcote Summary This example enhancement provides a view of transient history — that of the period that the SQlite database holds. AlertManager is not currently intended to provide long-term storage. Contributing is easier than you may think. Reference Alert History fork Alert History tutorial

Slide 33

Slide 33 text

Resources IRC: on Mailing lists: – discussing Prometheus usage and community support – contributing to Prometheus development to file bugs and features requests #prometheus irc.freenode.net prometheus-users prometheus-developers @PrometheusIO Prometheus repositories @lcalcote #

Slide 34

Slide 34 text

Lee Calcote Thank you. Questions? clouds, containers, infrastructure, applications and their management linkedin.com/in/leecalcote @lcalcote blog.gingergeek.com [email protected] calcotestudios.com/talks yes, we're hiring