Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Statistical Anomaly Detector in Elasticsearch

Building a Statistical Anomaly Detector in Elasticsearch

Zachary Tong

April 19, 2016
Tweet

More Decks by Zachary Tong

Other Decks in Technology

Transcript

  1. ‹#›
    Zachary Tong
    [email protected]
    Building a Statistical
    Anomaly Detector

    View full-size slide

  2. 2
    The Problem
    • 45m data points
    • 75,000 time-series
    • 8 large-scale, simulated
    “disruptions”

    View full-size slide

  3. 3
    Some Random Disruptions

    View full-size slide

  4. 4
    eBay’s Atlas Monitoring System
    • Atlas was designed to monitor eBay search results in real-time
    • Built in-house, but they published a paper
    • I wanted to re-implement it in Elasticsearch
    • Goldberg, David, and Yinan Shan. "The importance of features for statistical anomaly
    detection." 7th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 15).
    2015.

    View full-size slide

  5. 5
    Queries and Metrics
    Query: “thinkpad laptop”
    Metrics: • Number of results
    • Average price
    • Min price
    • Max price
    • Average age
    • Distinct sellers
    • etc

    View full-size slide

  6. 6
    Queries and Metrics
    Query: “thinkpad laptop”
    Metrics: • Number of results
    • Average price
    • Min price
    • Max price
    • Average age
    • Distinct sellers
    • etc

    View full-size slide

  7. 7
    Queries and Metrics
    Query: “thinkpad laptop”
    Metrics: • Number of results
    • Average price
    • Min price
    • Max price
    • Average age
    • Distinct sellers
    • etc

    View full-size slide

  8. 8
    Queries and Metrics
    Query: “thinkpad laptop”
    Metrics: • Number of results
    • Average price
    • Min price
    • Max price
    • Average age
    • Distinct sellers
    • etc

    View full-size slide

  9. 9
    Source: Gray Arial10pt
    Can be any type of metric!
    Netflow
    Click traffic
    Server stats
    Cohort analysis
    IoT Sensor readings
    Marketing campaigns

    Just needs timestamp + numeric value

    View full-size slide

  10. 10
    Finding “surprising” series

    View full-size slide

  11. 11
    Calculate
    series average
    Finding “surprising” series

    View full-size slide

  12. 12
    Find largest “surprise”
    Finding “surprising” series

    View full-size slide

  13. 13
    Repeat for all series
    25
    74
    15
    3
    19
    82
    Finding “surprising” series

    View full-size slide

  14. 14
    Sort the surprise
    25
    74
    15
    3
    19
    82
    82
    74
    25
    19
    15
    3
    Finding “surprising” series

    View full-size slide

  15. 15
    Calculate 95th
    Percentile
    25
    74
    15
    3
    19
    82
    82
    74
    25
    19
    15
    3
    68
    Finding “surprising” series

    View full-size slide

  16. 16
    Plot value,
    wait n minutes
    25
    74
    15
    3
    19
    82
    82
    74
    25
    19
    15
    3
    68
    Finding “surprising” series

    View full-size slide

  17. 17
    Repeat entire
    procedure
    Finding “surprising” series

    View full-size slide

  18. 18
    28
    62
    23
    25
    4
    19
    Repeat entire
    procedure
    Finding “surprising” series

    View full-size slide

  19. 19
    Repeat entire
    procedure
    62
    28
    25
    23
    19
    4
    28
    62
    23
    25
    4
    19
    Finding “surprising” series

    View full-size slide

  20. 20
    Repeat entire
    procedure
    62
    28
    25
    23
    19
    4
    61
    28
    62
    23
    25
    4
    19
    Finding “surprising” series

    View full-size slide

  21. 21
    Finding “surprising” series
    Repeat entire
    procedure
    62
    28
    25
    23
    19
    4
    61
    28
    62
    23
    25
    4
    19

    View full-size slide

  22. 22
    Surprise
    Time
    Top 95th percentile Surprise
    Flagging Anomalies

    View full-size slide

  23. 23
    Surprise
    Time
    Top 95th percentile Surprise
    3 standard deviation threshold
    Flagging Anomalies

    View full-size slide

  24. 24
    Flagging Anomalies
    Surprise
    Time
    Top 95th percentile Surprise
    Anomaly!
    3 standard deviation threshold

    View full-size slide

  25. 25
    Turns meaningless data …. into discrete alerts

    View full-size slide

  26. 26
    Elasticsearch
    Pipeline
    Aggregations
    Generates the raw data
    Terms
    Terms
    Date_histo
    Avg
    Moving Avg
    Bucket Script
    Max Bucket
    Percentiles Bucket

    View full-size slide

  27. 27
    Watcher
    Executes data collection
    & anomaly detector aggs

    View full-size slide

  28. 28
    Kibana’s
    Timelion
    Flexible ad-hoc
    charting

    View full-size slide

  29. 29
    Resources
    • eBay’s original article
    http://www.ebaytechblog.com/2015/08/19/statistical-anomaly-detection/

    • “Implementing a Statistical Anomaly Detector in Elasticsearch”
    https://www.elastic.co/blog/implementing-a-statistical-anomaly-detector-part-1
    https://www.elastic.co/blog/implementing-a-statistical-anomaly-detector-part-2
    https://www.elastic.co/blog/implementing-a-statistical-anomaly-detector-part-3

    View full-size slide

  30. ‹#›
    Questions?

    View full-size slide

  31. ‹#›
    Please attribute Elastic with a link to elastic.co
    Except where otherwise noted, this work is licensed under
    http://creativecommons.org/licenses/by-nd/4.0/
    Creative Commons and the double C in a circle are
    registered trademarks of Creative Commons in the United States and other countries.
    Third party marks and brands are the property of their respective holders.
    31

    View full-size slide