Uber: Anomaly Detection At Scale

Anomaly Detection at Scale Or, how Uber monitors the health
of its city operations September 22, 2016 Velocity NYC

Akshay Shah [email protected] @akshayjshah Michael Hamrah [email protected] @mhamrah

Observability @ Uber Trips Anomaly detection Roll-out & evaluation Lessons
learned

425+ cities 1500+ services

425+ cities configurations 1500+ services sources of failure

Nobody’s just browsing.

Observability @ Uber Whitebox monitoring lets us check individual nodes.
It’s granular, simple, nuanced, and cheap. Uber started off with Graphite and Nagios; now we use an in-house system backed by Cassandra and ElasticSearch. We’re also developing an in-memory TSDB we plan to open source. Blackbox monitoring lets us check whether the overall system is working. It’s holistic, complex, and binary. We use a custom-built application running in the cloud to check a handful of critical scenarios.

Trips in the North Pole Imagine that Uber operated in
the North Pole. The city team there wants to make sure that uberPOOL is working, and this chart shows the number of completed pool trips. How do we monitor this?

Trips in the North Pole: Whitebox Hard-coded thresholds don’t work
well for business metrics: peak times are too different from off-peak.

Trips in the North Pole: Blackbox UberPOOL is one of
dozens of products, in hundreds of cities, all of which are enabled and rolled out with city-, driver-, and rider-scoped configuration. How many blackbox checks can we write and maintain? Many business-critical functions don’t have external APIs.

Anomaly detection answers a simple question: is the current measurement
weird?

github.com/etsy/skyline

Generic solutions are hard github.com/etsy/skyline github.com/ankane/anomaly github.com/numenta/nupic github.com/linkedin/luminol github.com/yahoo/egads github.com/twitter/AnomalyDetection
“Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.” ¯\_( ツ )_/¯

Generic solutions are hard slow, complex, and imperfect Generic anomaly
detection is an active research area, with approaches based on statistics, run-of-the-mill machine learning, and deep learning. Papers don’t always work well in production, especially if you only care about your data. Plus, we don’t want cutting-edge machine learning algorithms in our monitoring stack (yet). So we kept it simple and took shortcuts

Shortcuts Uber’s business has a strong weekly cycle - weekends
are busy, and weekday afternoons are quiet. Changes in rider demand and car supply happen over minutes and hours, not seconds. Virtually all important business metrics and user interactions have the same weekly cycle and slow change. We hard-code those parameters and hand-configure any exceptions. now -1w -3w -2w -4w -5w

Simplicity Every half-hour, we make a forecast about the next
half-hour. What are the upper and lower bounds of normal? Save the forecast. Every minute, check the most recent reading against the latest forecast and page someone if we’re outside the normal range. (This is basically teaching a computer set Nagios thresholds.) 5w raw data forecasting (every 30m) thresholds current data evaluation (every 1m) alert

Forecasting If it’s 4pm on a Thursday, so focus on
the 4-5pm hour from the past five Thursdays. For each week, calculate a ratio between the 5pm measurement and the 4pm measurement. Multiply the median ratio by the current number. That’s the forecast. To generate thresholds, estimate how spiky this metric is. Multiply that by a hard-coded constant, and then add a little fudge factor. Add that to your forecast to generate an upper threshold, and subtract to generate the lower threshold. Use increasing constants to produce thresholds for increasing degrees of weirdness. (We use a 9-point scale.) 1.1x 1.4x 1.1x 1.3x 1.5x now -1w -2w -3w -4w -5w 4pm 5pm forecast 1.3x current

Trips in the North Pole Armed with our forecasting algorithm,
monitoring uberPOOL in the North Pole is easy.*

A Real Anomaly

Complexity Of course, it’s not quite that simple. We window
and summarize the input data to make our forecasts more stable, and we ended up needing other approaches to detect really gradual and abrupt outages. We also use a few different techniques to prevent outages from polluting subsequent forecasts. Even with all that, this algorithm is implemented in 1300 lines of Go and doesn’t use anything beyond the standard library. Creating a forecast takes tens of milliseconds. 99% of outages detected 65% of pages accurate

Implementation A dedicated service polls for data periodically, calculates the
forecasts, and saves them. On demand, the core query engine fetches and applies forecasts to convert raw metrics into anomaly levels (on a 9-point scale). Our monitoring system’s query language has a user-defined function for anomaly detection. To users, this is no different from any other function. We’re currently using anomaly detection to track thousands of metrics. Query Engine Dashboards Alerting Metrics Ingest and Storage Forecast Storage Forecast Engine

Challenges: Adoption Threshold-based alerting is familiar, so machine-calculated thresholds were
a comfortable next step. More exotic techniques would have taken longer to be trusted. Visualizing the thresholds helped a lot. Usability really matters, even for an audience of developers. Adding a dedicated UI increased adoption greatly, even among engineers comfortable with the monitoring query language.

Challenges: Tracking Effectiveness Difficult for us to measure real-world accuracy.
Our outages database wasn’t designed for training algorithms, so there isn’t a 100%-complete source of truth. Best option is to have users rate each page. Misconfiguration is common, and sometimes metrics change over time. We now auto-detect when forecasting is applied to an unsuitable timeseries and message the user. Improving these algorithms is hard - it requires batch access to lots of monitoring data. We haven’t solved this one yet.

Lessons Learned • Hard-coding our typical weekly period made the
algorithm much simpler. • The core forecasting idea was familiar to users. • Polling forecast engine can be used with other data sources. • Architecture keeps experimental forecasting code separate from mission-critical query engine. • Go was a good language choice, despite lack of libraries. • Measure real-world accuracy (v. using a set of test data). • Vet initial users carefully, plan for dirty input data. • Build a complete outage-reporting system sooner. • Get monitoring data into a batch-processing system.

Of course, the two of us didn’t build all this
ourselves. Dozens of engineers and data scientists, both here and in SF, built these systems and continue to improve them.

Akshay Shah [email protected] @akshayjshah Michael Hamrah [email protected] @mhamrah Thanks! http://eng.uber.com/

Uber: Anomaly Detection At Scale

Uber: Anomaly Detection At Scale

Michael Hamrah

More Decks by Michael Hamrah

Other Decks in Programming

Featured

Transcript

Anomaly Detection at Scale Or, how Uber monitors the health

Akshay Shah [email protected] @akshayjshah Michael Hamrah [email protected] @mhamrah

Observability @ Uber Trips Anomaly detection Roll-out & evaluation Lessons

425+ cities 1500+ services

425+ cities configurations 1500+ services sources of failure

Nobody’s just browsing.

Observability @ Uber Whitebox monitoring lets us check individual nodes.

Trips in the North Pole Imagine that Uber operated in

Trips in the North Pole: Whitebox Hard-coded thresholds don’t work

Trips in the North Pole: Blackbox UberPOOL is one of

Anomaly detection answers a simple question: is the current measurement

github.com/etsy/skyline

Generic solutions are hard github.com/etsy/skyline github.com/ankane/anomaly github.com/numenta/nupic github.com/linkedin/luminol github.com/yahoo/egads github.com/twitter/AnomalyDetection

Generic solutions are hard slow, complex, and imperfect Generic anomaly

Shortcuts Uber’s business has a strong weekly cycle - weekends

Simplicity Every half-hour, we make a forecast about the next

Forecasting If it’s 4pm on a Thursday, so focus on

Trips in the North Pole Armed with our forecasting algorithm,

A Real Anomaly

Complexity Of course, it’s not quite that simple. We window

Implementation A dedicated service polls for data periodically, calculates the

Challenges: Adoption Threshold-based alerting is familiar, so machine-calculated thresholds were

Challenges: Tracking Effectiveness Difficult for us to measure real-world accuracy.

Lessons Learned • Hard-coding our typical weekly period made the

Of course, the two of us didn’t build all this

Akshay Shah [email protected] @akshayjshah Michael Hamrah [email protected] @mhamrah Thanks! http://eng.uber.com/