Slide 1

Slide 1 text

Anomaly Detection at Scale Or, how Uber monitors the health of its city operations September 22, 2016 Velocity NYC

Slide 2

Slide 2 text

Akshay Shah [email protected] @akshayjshah Michael Hamrah [email protected] @mhamrah

Slide 3

Slide 3 text

Observability @ Uber Trips Anomaly detection Roll-out & evaluation Lessons learned

Slide 4

Slide 4 text

425+ cities 1500+ services

Slide 5

Slide 5 text

425+ cities configurations 1500+ services sources of failure

Slide 6

Slide 6 text

Nobody’s just browsing.

Slide 7

Slide 7 text

Observability @ Uber Whitebox monitoring lets us check individual nodes. It’s granular, simple, nuanced, and cheap. Uber started off with Graphite and Nagios; now we use an in-house system backed by Cassandra and ElasticSearch. We’re also developing an in-memory TSDB we plan to open source. Blackbox monitoring lets us check whether the overall system is working. It’s holistic, complex, and binary. We use a custom-built application running in the cloud to check a handful of critical scenarios.

Slide 8

Slide 8 text

Trips in the North Pole Imagine that Uber operated in the North Pole. The city team there wants to make sure that uberPOOL is working, and this chart shows the number of completed pool trips. How do we monitor this?

Slide 9

Slide 9 text

Trips in the North Pole: Whitebox Hard-coded thresholds don’t work well for business metrics: peak times are too different from off-peak.

Slide 10

Slide 10 text

Trips in the North Pole: Blackbox UberPOOL is one of dozens of products, in hundreds of cities, all of which are enabled and rolled out with city-, driver-, and rider-scoped configuration. How many blackbox checks can we write and maintain? Many business-critical functions don’t have external APIs.

Slide 11

Slide 11 text

Anomaly detection answers a simple question: is the current measurement weird?

Slide 12

Slide 12 text

github.com/etsy/skyline

Slide 13

Slide 13 text

Generic solutions are hard github.com/etsy/skyline github.com/ankane/anomaly github.com/numenta/nupic github.com/linkedin/luminol github.com/yahoo/egads github.com/twitter/AnomalyDetection “Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.” ¯\_( ツ )_/¯

Slide 14

Slide 14 text

Generic solutions are hard slow, complex, and imperfect Generic anomaly detection is an active research area, with approaches based on statistics, run-of-the-mill machine learning, and deep learning. Papers don’t always work well in production, especially if you only care about your data. Plus, we don’t want cutting-edge machine learning algorithms in our monitoring stack (yet). So we kept it simple and took shortcuts

Slide 15

Slide 15 text

Shortcuts Uber’s business has a strong weekly cycle - weekends are busy, and weekday afternoons are quiet. Changes in rider demand and car supply happen over minutes and hours, not seconds. Virtually all important business metrics and user interactions have the same weekly cycle and slow change. We hard-code those parameters and hand-configure any exceptions. now -1w -3w -2w -4w -5w

Slide 16

Slide 16 text

Simplicity Every half-hour, we make a forecast about the next half-hour. What are the upper and lower bounds of normal? Save the forecast. Every minute, check the most recent reading against the latest forecast and page someone if we’re outside the normal range. (This is basically teaching a computer set Nagios thresholds.) 5w raw data forecasting (every 30m) thresholds current data evaluation (every 1m) alert

Slide 17

Slide 17 text

Forecasting If it’s 4pm on a Thursday, so focus on the 4-5pm hour from the past five Thursdays. For each week, calculate a ratio between the 5pm measurement and the 4pm measurement. Multiply the median ratio by the current number. That’s the forecast. To generate thresholds, estimate how spiky this metric is. Multiply that by a hard-coded constant, and then add a little fudge factor. Add that to your forecast to generate an upper threshold, and subtract to generate the lower threshold. Use increasing constants to produce thresholds for increasing degrees of weirdness. (We use a 9-point scale.) 1.1x 1.4x 1.1x 1.3x 1.5x now -1w -2w -3w -4w -5w 4pm 5pm forecast 1.3x current

Slide 18

Slide 18 text

Trips in the North Pole Armed with our forecasting algorithm, monitoring uberPOOL in the North Pole is easy.*

Slide 19

Slide 19 text

A Real Anomaly

Slide 20

Slide 20 text

Complexity Of course, it’s not quite that simple. We window and summarize the input data to make our forecasts more stable, and we ended up needing other approaches to detect really gradual and abrupt outages. We also use a few different techniques to prevent outages from polluting subsequent forecasts. Even with all that, this algorithm is implemented in 1300 lines of Go and doesn’t use anything beyond the standard library. Creating a forecast takes tens of milliseconds. 99% of outages detected 65% of pages accurate

Slide 21

Slide 21 text

Implementation A dedicated service polls for data periodically, calculates the forecasts, and saves them. On demand, the core query engine fetches and applies forecasts to convert raw metrics into anomaly levels (on a 9-point scale). Our monitoring system’s query language has a user-defined function for anomaly detection. To users, this is no different from any other function. We’re currently using anomaly detection to track thousands of metrics. Query Engine Dashboards Alerting Metrics Ingest and Storage Forecast Storage Forecast Engine

Slide 22

Slide 22 text

Challenges: Adoption Threshold-based alerting is familiar, so machine-calculated thresholds were a comfortable next step. More exotic techniques would have taken longer to be trusted. Visualizing the thresholds helped a lot. Usability really matters, even for an audience of developers. Adding a dedicated UI increased adoption greatly, even among engineers comfortable with the monitoring query language.

Slide 23

Slide 23 text

Challenges: Tracking Effectiveness Difficult for us to measure real-world accuracy. Our outages database wasn’t designed for training algorithms, so there isn’t a 100%-complete source of truth. Best option is to have users rate each page. Misconfiguration is common, and sometimes metrics change over time. We now auto-detect when forecasting is applied to an unsuitable timeseries and message the user. Improving these algorithms is hard - it requires batch access to lots of monitoring data. We haven’t solved this one yet.

Slide 24

Slide 24 text

Lessons Learned ● Hard-coding our typical weekly period made the algorithm much simpler. ● The core forecasting idea was familiar to users. ● Polling forecast engine can be used with other data sources. ● Architecture keeps experimental forecasting code separate from mission-critical query engine. ● Go was a good language choice, despite lack of libraries. ● Measure real-world accuracy (v. using a set of test data). ● Vet initial users carefully, plan for dirty input data. ● Build a complete outage-reporting system sooner. ● Get monitoring data into a batch-processing system.

Slide 25

Slide 25 text

Of course, the two of us didn’t build all this ourselves. Dozens of engineers and data scientists, both here and in SF, built these systems and continue to improve them.

Slide 26

Slide 26 text

Akshay Shah [email protected] @akshayjshah Michael Hamrah [email protected] @mhamrah Thanks! http://eng.uber.com/