Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Semi-Supervised Anomaly Detection

Semi-Supervised Anomaly Detection

Presentation by Eric Kramer. data scientists @dataiku from data science track at #StrataHadoop London 2016

Data Science London

June 03, 2016
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. Me The Numbers 1/2 1/2 + = ! > df

    = data_scientists %>% inner_join(r_users) %>% filter(speaker_at == “meetup”) > df$name [1] “Eric Kramer”
  2. Agenda 30 minutes Introduction Definition Use Cases Formalization Building an

    anomaly detector in R Theory R code Result Summary Improvements Concerns Questions } Iterate 3 times, increasingly complexity each time
  3. What is an anomaly? Global anomalies are unexpected based on

    the entire dataset Local anomalies are unexpected given their context Anomalies are “unexpected” observations
  4. Formalization P ( y | x ) < ⇢ An

    observation is an anomaly if where y x ⇢ represents the observation represents the context is some arbitrary threshold
  5. Use Case: Anomalous Weather P ( y | x )

    < ⇢ Current weather: • Temperature • Humidity • Pressure Current Location: • Lat, Long • Altitude • Date • Time
  6. Use case: Bank Fraud P ( y | x )

    < ⇢ Financial transaction: • Origin • Destination • Amount • Medium Account history: • Past transactions • Account address • Account balance • Account flux
  7. Use case: EEG P ( y | x ) <

    ⇢ EEG reading Patient History: • Diagnoses • Interventions / Surgeries • Current medications
  8. Questions P ( y | x ) < ⇢ How

    do I choose ? P ( y | x ) < ⇢ Find your tolerance for false positives e.g. 100 false positives is equivalent to stopping one fraud How do I calculate ? Density Estimation P ( y | x ) < ⇢
  9. Density Estimation We’re going to use Gaussian Mixture Models. Alternatives:

    • Kernel density estimators • Histograms • K-means • Bayesian methods
  10. Goals Weather in Paris, London and NYC for 2010-2015 Can

    we find days with anomalousweather? Can we controlfor the location of the measurement? Can we controlfor the date?
  11. Theory Gaussian Mixture Models The probability density is the sum

    of a small number of Gaussian distributions = +
  12. Theory Gaussian Mixture Models The probability density is the sum

    of a small number of Gaussian distributions Number of Gaussian distributions to use Gaussian distribution Mean of ith Gaussian distribution P(y) = 1 n n X i=1 N(µi, 2 i ) Variance of ith Gaussian distribution
  13. Questions Gaussian Mixture Models P(y) = 1 n n X

    i=1 N(µi, 2 i ) How do I find and ? Maximum likelihood optimization P(y) = 1 n n X i=1 N(µi, 2 i ) P(y) = 1 n n X i=1 N(µi, 2 i ) How do I choose ? Fit models for several and choose one with best BIC P(y) = 1 n n X i=1 N(µi, 2 i )
  14. Fitting a GMM Using the mclust package library(mclust) load(”./data/weather_data.Rdata”) gmm

    = Mclust(df[c(“temperature”, “humidity”], G=seq(1,6)) plot(gmm, what=“classification”) Try anywhere from 1 to 6 Gaussians in the mixture Just two dimensions for now Load package and data
  15. Getting a density from a GMM Using the mclust package

    Mclust(…) => densityMclust(…)
  16. What are the most anomalous days? Using the mclust package

    library(dplyr) df %>% mutate(score = gmm$density) %>% arrange(score) %>% head(3) City Temperature Humidity New  York 5 20 New  York -­‐12 39 New  York 29 27
  17. P(temperature, humidity|city) Option 1: Option 2: Train one GMM for

    each city Regress temperature and humidity on city Build GMM on residuals of model
  18. Fitting multiple GMMs train_gmm = function(df){ densityMclust(df[c(“temperature”, “humidity”)], G=seq(1,6)) }

    gmms = df %>% nest(-city) %>% mutate(gmm = map(data, train_gmm)) map(gmms$gmm, plot, what=“density”) } Wrap training in a function } Train one GMM per city
  19. What are the most anomalous days? Controlling for location gmms

    %>% mutate(score = map(gmm, "density")) %>% select(-gmm) %>% unnest() %>% arrange(score) %>% head(3) City Temperature Humidity Paris -­‐6 51 Paris -­‐6 51 London 29 40
  20. Fitting multiple GMMs train_gmm = function(df){ densityMclust(df[c("temperature", "humidity", "visibility", "wind_speed")],

    G=seq(1,6)) } gmms = df %>% nest(-city) %>% mutate(gmm = map(data, train_gmm)) map(gmms$gmm, plot, what=“density”) } Include four columns this time
  21. What are the most anomalous days? Controlling for location gmms

    %>% mutate(score = map(gmm, "density")) %>% select(-gmm) %>% unnest() %>% arrange(score) %>% head(3) City Temperature Humidity Wind  Speed Pressure New York 23 49 159 1015 London 29 40 16 1012 New  York 14 80 30 980 Hurricane winds Insanely hot for London Crazy low pressure
  22. Summary • Semi-supervised learning attemps to learn the distribution of

    data. Anomalies are then low-probability observations • GMMs provide a quick-and-easy way to estimate probability densities • Mclust is an awesome GMM package for R • Anomalies highly dependent on context (i.e. city) and the variables included in detector (i.e. temperature, humidity, wind speed, pressure). Semi-supervised learning
  23. Moving to Production • Train GMM on “normal” data, updated

    weekly • REST API scores incoming data based on last weeks data • Threshold chosen based on false positive tolerance Semi-supervised learning