Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unsupervised Anomaly Detection And Forecasting For Enterprise Time Series

Unsupervised Anomaly Detection And Forecasting For Enterprise Time Series

Time Series are an omnipresent type of data. How can one predict the future and detect anomalies in an online setting on a large set of time series. Talk held at the data science meetup Kempten.

Joachim Rosskopf

May 16, 2018
Tweet

More Decks by Joachim Rosskopf

Other Decks in Science

Transcript

  1. UNSUPERVISED ANOMALY DETECTION AND FORECASTING FOR ENTERPRISE TIME SERIES Time

    Series are an omnipresent type of data. How can one predict the future and detect anomalies in an online setting on a large set of time series. Wolfertschwenden, 16.05.18 Joachim Rosskopf, Dr. Simon Müller
  2. zoi.de WHO AM I? 3 Joachim Rosskopf Joachim is currently

    struggling to finish his PhD in theo. physics, mainly doing data analysis and optimization algorithms there. He is lead of the DataTeam at Zoi GmbH, where he tries to combine his experience from software architecture and technology with business analytics and data science to build interesting, valuable data solutions.
  3. zoi.de 4 RIGHT NOW, THERE IS A LOT OF BUZZ

    AROUND MACHINE LEARNING? IS THAT JUSTIFIED?
  4. zoi.de RIGHT NOW, THERE IS A LOT OF BUZZ AROUND

    MACHINE LEARNING? IS THAT JUSTIFIED? ▪ Example: Facebook image annotation ▪ Computer vision (CV) ▪ Millions of images per day. Mostly invisible to the user. ▪ A huge source of information about the user, and for advertisement. 5 A photo from “Mercedes Benz Deutschland”’s Facebook page
  5. zoi.de RIGHT NOW, THERE IS A LOT OF BUZZ AROUND

    MACHINE LEARNING? IS THAT JUSTIFIED? ▪ Example: YouTube autogenerated subtitles. ▪ Speech Recognitions / Speech 2 Text ▪ Millions of hours of speech per day. ▪ Cross pollination with Android ecosystem and other services. ▪ Big players have a ML strategy. 6 A Barack Obama speech on YouTube with auto generated subtitles.
  6. zoi.de RIGHT NOW, THERE IS A LOT OF BUZZ AROUND

    MACHINE LEARNING? IS THAT JUSTIFIED? ▪ Example: YouTube autogenerated subtitles. ▪ Speech Recognitions / Speech 2 Text ▪ Millions of hours of speech per day. ▪ Cross pollination with Android ecosystem and other services. ▪ Big players have a ML strategy. 7 A Barack Obama speech on YouTube with auto generated subtitles. AI WILL UNDER IMPRESS IN THE SHORT TERM, BUT BE TRANSFORMATIVE IN THE LONG TERM.
  7. zoi.de 8 WHAT IF GERMAN MANUFACTURING IS NOT RELYING ON

    SPEECH & TEXT, IMAGES OR VIDEOS? ▪ For manufacturing companies a lot of the blockbuster advancements don’t apply directly. ▪ There is a lot of uncertainty, where investments will gain business value in future. ▪ Companies should not just follow, what's hot in news and works for Google or Facebook.
  8. zoi.de 9 ▪ For manufacturing companies a lot of the

    blockbuster advancements don’t apply directly. ▪ There is a lot of uncertainty, where investments will gain business value in future. ▪ Companies should not just follow, what's hot in news and works for Google or Facebook. OUR COMPANIES ALSO COLLECT A LOT OF DATA. THIS DATA IS MAINLY TIME SERIES DATA. WHAT IF GERMAN MANUFACTURING IS NOT RELYING ON SPEECH & TEXT, IMAGES OR VIDEOS?
  9. 10 zoi.de 10 AGENDA ▪ Introduction to time series. ▪

    Forecasting and anomaly detection with different methods on smart meter / IoT data: ◦ The machine learning way ◦ With Autoregressive Models ◦ With recurrent neural networks (RNN) ▪ Time to event/failure prediction on JetEngine data: ◦ Survival statistics as basics ◦ A nice fusion of RNN and the Weibull distribution ▪ Practical realization with Streaming, Open Source and the Cloud.
  10. zoi.de COMPANIES PRODUCE LOTS OF TIME SERIES TIGHTLY CONNECTED TO

    BUSINESS PROCESSES ▪ A sequence of data points indexed by a time dimension. ▪ In most cases the sequence is discrete sampled at equally spaced points in time. ▪ Common time series consist of real-valued univariate dataset. But also multivariate series or series of categorical data. 11 The household MAC002321 from the London Smart Meter Data Set (Cluster 3)
  11. zoi.de COMPANIES PRODUCE LOTS OF TIME SERIES TIGHTLY CONNECTED TO

    BUSINESS PROCESSES ▪ A sequence of data points indexed by a time dimension. ▪ In most cases the sequence is discrete sampled at equally spaced points in time. ▪ Common time series consist of real-valued univariate dataset. But also multivariate series or series of categorical data. 12 The households MAC002321 and MAC000034 from the London Smart Meter Data Set (Cluster 3 and 2)
  12. zoi.de 13 WHAT IS THE DATA PROBLEM? ▪ Due to

    combinatorics LoB and IoT systems produce a lot of time series. ▪ People spend a lot of time monitoring, interpreting and predicting time-series. ▪ But doing that for a large scale of series, in a timely fashion is laborious and error prone. ▪ It get’s even more challenging, if one wants to base business models or product functionality on these features (London smart meter dynamic pricing).
  13. zoi.de WHERE DO THE SERIES STEM FROM? 14 ECommerce Conversions

    Sales IoT Interaction Inventory Ad Type Medium Campaign Device Category Channel Country Region Device Type Action Location Customer Replenish. Time Class. Order Point Material Turnover Unfolding dimension in a traditional data warehouse leads to a multitude of time series of the respective measures. Predicting them is of great value!
  14. 15 zoi.de OVERVIEW OF OUR ALGORITHMS 15 No single algorithm

    is able to work on all series equally. A challenge is to do the right preprocessing and algorithm selection. More General Model 1 2 3 5 Neural Networks (e.g. Autoencoder, LSTM, GRU) Generalized Autoregressive Conditional Heteroscedastic (GARCH) Autoregressive Model with Integrated Moving Average (ARIMA) Exponential Smoothing (ETS) 4 Regression Models (e.g. Decision Tree Regression)
  15. 16 zoi.de 16 FORECASTING & ANOMALY DETECTION NOTEBOOKS Exploration -

    Smart Meter London Explain and explore the dataset https://mybinder.org/v2/gh/anofox/m3_konferenz/master?filepath=notebooks%2F01_ Smart%20Meter%20London%20-%20Exploration.ipynb Quantile Random Forest - Smart Meter London Use Random Forest Regression for Time Series Prediction https://mybinder.org/v2/gh/anofox/m3_konferenz/master?filepath=notebooks%2F02_ Smart%20Meter%20London%20-%20Quantile%20Random%20Forest.ipynb ARIMA, ETS, and GARCH - Time Series Prediction Use Random Forest Regression for Time Series Prediction https://mybinder.org/v2/gh/anofox/m3_konferenz/master?filepath=notebooks%2F03_ Smart%20Meter%20London%20-%20ARIMA%2C%20ETS%2C%20and%20GARCH.ipynb Deep Learning - Time Series Prediction Train and predict with an simple RNN https://mybinder.org/v2/gh/anofox/m3_konferenz/master?filepath=notebooks%2F04_ Smart%20Meter%20London%20-%20LSTM.ipynb Outlier Detection - Smart Meter London Use an autoencoder together and extreme value theory to mark unlikely events as anomalies https://mybinder.org/v2/gh/anofox/m3_konferenz/master?filepath=notebooks%2F05_ Smart%20Meter%20London%20-%20Outlier_Detection.ipynb
  16. zoi.de TIME TO EVENT, EVENT TIME ANALYSIS AND CENSORED DATA

    ▪ The target variable in many enterprise time-series is not continuous, but rather an event. ▪ This is a typical setting in event time analysis, where we want to predict the remaining lifetime or death of an individual. ▪ Dependent on the domain, this point in time depends on different features, like usage, blood pressure, oil temperature, etc. 17
  17. zoi.de TIME TO EVENT, EVENT TIME ANALYSIS AND CENSORED DATA

    ▪ In event time analysis typically not all events are observed. ▪ So we know how old we are, but not, when we will die. Age is a right censored datapoint. ◦ Events are known up to a certain point in time. ◦ After this event, we haven’t observed a new event yet. But we still gather data on the features. During this time the event is censored. 18
  18. zoi.de 19 NASA JETENGINE DATASET ▪ Fleet of 100 aircraft

    engines of the same model. ▪ Starts with different unknown degrees of initial wear and manufacturing variation. ▪ Degrades over time until a predefined, unknown failure threshold is reached. ▪ 24 features, 1 event = failure at end of each time series. ▪ Predict from any point in time until maintenance.
  19. zoi.de NASA JETENGINE DATASET ▪ Fleet of 100 aircraft engines

    of the same model. ▪ Starts with different unknown degrees of initial wear and manufacturing variation. ▪ Degrades over time until a predefined, unknown failure threshold is reached. ▪ 24 features, 1 event = failure at end of each time series. ▪ Predict from any point in time until maintenance. 20 Kaplan Meier fit on the training data of the 100 engines, median (199 time units)
  20. zoi.de IN EVENT TIME ANALYSIS THE WEIBULL DISTRIBUTION IS WELL

    KNOWN. ▪ Characteristics and benefits of the weibull distribution: Continuous or discrete closed form ▪ Occurs in nature, e.g. in event time analysis, reliability engineering and failure analysis, industrial engineering to represent manufacturing and delivery times. ▪ There exists literature with practical examples, e.g. for regularization. 22
  21. zoi.de THE WEIBULL TIME TO EVENT RECURRENT NEURAL NETWORK EFFICIENTLY

    ENABLE PREDICTION OF FUTURE EVENT TIMES. 23 https://ragulpr.github.io/2016/12/22/WTTE-RNN-Hackless-churn-modeling/
  22. 24 zoi.de 24 TTE DEMO NOTEBOOKS Exploration - NASA JetEngine

    Failure Analysis Explain and explore the dataset https://mybinder.org/v2/gh/anofox/m3_konferenz/master?filepath=notebooks%2 F10_JetEngine%20Failures%20-%20Exploration%20%26%20Basics.ipynb Time to event prediction - NASA JetEngine Failure Analysis Train and evaluate the RNN with the adapted Weibull likelihood https://mybinder.org/v2/gh/anofox/m3_konferenz/master?filepath=notebooks%2 F11_JetEngine%20Failures%20-%20WTTE-RNN.ipynb
  23. zoi.de THE FUTURE OF DATA ANALYTICS 25 Today Tomorrow Online

    Models Data Streams Actions & Events Data Batches Storage Business Intelligence Machine Learning Challenges ▪ Talent shortage, not automated ▪ Reaction is slow ▪ Concept Drift (Model obsolescence) Advantages ▪ Automated model creation ▪ Continuous learning ▪ Real-time ▪ Basis of higher level analytics, e.g. prediction
  24. zoi.de TIME SERIES ANALYSIS AS ONLINE LEARNING PROBLEM ▪ Fault

    tolerant ▪ Exactly once ▪ Event time based ▪ Stateful ▪ Distributed ▪ Parallel 26 Processing Time Event Time Source Source Algo. fit Algo. predict State Algo. fit Algo. predict State Algo. fit Algo. transfor m Sink Sink
  25. ZOI: OUR DNA? DIGITAL. zoi.de 27 ▪ We are growing

    with experienced minds at our locations in Stuttgart and Berlin ▪ We combine new technologies, tools and methods with our strong competence to implement and the challenges of our customers. ▪ We are computer scientists, electrical engineers, mathematicians, physics, biologists, business economics. ▪ Our technological drive is unbroken: We use part of our working time trying out new technologies. ▪ Zoi is a 100% digital subsidiary of Kaercher. ZOI IS THE ABBREVIATION FOR ZERO ONE INFINITY: OUR DIGITAL DNA
  26. zoi.de 28 Meet the AnoFox Use Cases in the wild:

    ▪ Scalable, unsupervised, online anomaly detection and time-series prediction on business and IoT data. ▪ Quickly deployable as building block into the virtual private clouds of customers. We come, where your data & processing happens! ▪ We rely on cloud services, open source software and modern data science methods. At its core we rely on battle tested data analysis. For higher level intelligence we utilize state of the art machine learning research. Predict usage behavior of simple IoT devices. E.g. when will the user use/activate a function in a product next. Predict inventory or parts demand. Focus on high granularity, meaning erratic demands (e.g. spare parts)
  27. 29 zoi.de 29 CONCLUSIONS ▪ Time Series are an omnipresent

    type of data, which is especially interesting for business and IoT applications. ▪ There exist powerful algorithms to detect anomalies or predict future data points in an unsupervised setting. ▪ We demonstrated on two different datasets how continuous time series and events can be treated. ▪ Spark Streaming, Open Source and the Cloud are a decent environment for building streaming anomaly detection and prediction applications. ▪ If you want details, examples or see some math or code have a look at the notebooks or feel free to reach us after the talk or via email/twitter.
  28. zoi.de 31 THANK YOU FOR THE OPPORTUNITY TO PRESENT OUR

    IDEAS! Unsupervised Anomaly Detection And Forecasting for Enterprise Time Series Joachims Email: [email protected] Joachims Twitter: @jrosskopf Simons Email: [email protected] Simons Twitter: @datamue WE ARE HIRING! [email protected]