Slide 1

Slide 1 text

Semi-Supervised Anomaly Detection Use cases, theory and hands-on

Slide 2

Slide 2 text

Dataiku The Numbers x 50 + x 1 + +  48

Slide 3

Slide 3 text

Me The Numbers 1/2 1/2 + = ! > df = data_scientists %>% inner_join(r_users) %>% filter(speaker_at == “meetup”) > df$name [1] “Eric Kramer”

Slide 4

Slide 4 text

Agenda 30 minutes Introduction Definition Use Cases Formalization Building an anomaly detector in R Theory R code Result Summary Improvements Concerns Questions } Iterate 3 times, increasingly complexity each time

Slide 5

Slide 5 text

Code Everything is on github https://github.com/erickramer/anomaly_detection_demo

Slide 6

Slide 6 text

Introduction

Slide 7

Slide 7 text

What is an anomaly? Global anomalies are unexpected based on the entire dataset Local anomalies are unexpected given their context Anomalies are “unexpected” observations

Slide 8

Slide 8 text

Use Case: Bank Fraud

Slide 9

Slide 9 text

Use Case: EEG Is this an anomaly?

Slide 10

Slide 10 text

Key Principles Anomalies are rare Anomalies have little in common with each other

Slide 11

Slide 11 text

Formalization P ( y | x ) < ⇢ An observation is an anomaly if where y x ⇢ represents the observation represents the context is some arbitrary threshold

Slide 12

Slide 12 text

Formalization P ( y | x ) < ⇢ Observation Context

Slide 13

Slide 13 text

Use Case: Anomalous Weather P ( y | x ) < ⇢ Current weather: • Temperature • Humidity • Pressure Current Location: • Lat, Long • Altitude • Date • Time

Slide 14

Slide 14 text

Use case: Bank Fraud P ( y | x ) < ⇢ Financial transaction: • Origin • Destination • Amount • Medium Account history: • Past transactions • Account address • Account balance • Account flux

Slide 15

Slide 15 text

Use case: EEG P ( y | x ) < ⇢ EEG reading Patient History: • Diagnoses • Interventions / Surgeries • Current medications

Slide 16

Slide 16 text

Questions P ( y | x ) < ⇢ How do I choose ? P ( y | x ) < ⇢ Find your tolerance for false positives e.g. 100 false positives is equivalent to stopping one fraud How do I calculate ? Density Estimation P ( y | x ) < ⇢

Slide 17

Slide 17 text

Density Estimation We’re going to use Gaussian Mixture Models. Alternatives: • Kernel density estimators • Histograms • K-means • Bayesian methods

Slide 18

Slide 18 text

Our data Weather in Paris, London and NYC for 2010-2015

Slide 19

Slide 19 text

Goals Weather in Paris, London and NYC for 2010-2015 Can we find days with anomalousweather? Can we controlfor the location of the measurement? Can we controlfor the date?

Slide 20

Slide 20 text

Gaussian Mixture Models

Slide 21

Slide 21 text

Theory Gaussian Mixture Models The probability density is the sum of a small number of Gaussian distributions = +

Slide 22

Slide 22 text

Theory Gaussian Mixture Models The probability density is the sum of a small number of Gaussian distributions Number of Gaussian distributions to use Gaussian distribution Mean of ith Gaussian distribution P(y) = 1 n n X i=1 N(µi, 2 i ) Variance of ith Gaussian distribution

Slide 23

Slide 23 text

Questions Gaussian Mixture Models P(y) = 1 n n X i=1 N(µi, 2 i ) How do I find and ? Maximum likelihood optimization P(y) = 1 n n X i=1 N(µi, 2 i ) P(y) = 1 n n X i=1 N(µi, 2 i ) How do I choose ? Fit models for several and choose one with best BIC P(y) = 1 n n X i=1 N(µi, 2 i )

Slide 24

Slide 24 text

Our first model

Slide 25

Slide 25 text

Fitting a GMM Using the mclust package library(mclust) load(”./data/weather_data.Rdata”) gmm = Mclust(df[c(“temperature”, “humidity”], G=seq(1,6)) plot(gmm, what=“classification”) Try anywhere from 1 to 6 Gaussians in the mixture Just two dimensions for now Load package and data

Slide 26

Slide 26 text

Fitting a GMM Using the mclust package

Slide 27

Slide 27 text

Fitting a GMM Using the mclust package

Slide 28

Slide 28 text

Getting a density from a GMM Using the mclust package Mclust(…) => densityMclust(…)

Slide 29

Slide 29 text

What are the most anomalous days? Using the mclust package library(dplyr) df %>% mutate(score = gmm$density) %>% arrange(score) %>% head(3) City Temperature Humidity New  York 5 20 New  York -­‐12 39 New  York 29 27

Slide 30

Slide 30 text

Choosing a Threshold Using the mclust package

Slide 31

Slide 31 text

What about controlling for the location of measurement?

Slide 32

Slide 32 text

Controlling for City

Slide 33

Slide 33 text

P(temperature, humidity|city) Option 1: Option 2: Train one GMM for each city Regress temperature and humidity on city Build GMM on residuals of model

Slide 34

Slide 34 text

Fitting multiple GMMs train_gmm = function(df){ densityMclust(df[c(“temperature”, “humidity”)], G=seq(1,6)) } gmms = df %>% nest(-city) %>% mutate(gmm = map(data, train_gmm)) map(gmms$gmm, plot, what=“density”) } Wrap training in a function } Train one GMM per city

Slide 35

Slide 35 text

Raw Data Weather in NYC is much more variable London NYC Paris

Slide 36

Slide 36 text

Multiple GMMs Weather in NYC is much more variable London NYC Paris

Slide 37

Slide 37 text

What are the most anomalous days? Controlling for location gmms %>% mutate(score = map(gmm, "density")) %>% select(-gmm) %>% unnest() %>% arrange(score) %>% head(3) City Temperature Humidity Paris -­‐6 51 Paris -­‐6 51 London 29 40

Slide 38

Slide 38 text

Choosing a Threshold Using the mclust package

Slide 39

Slide 39 text

Increasing the Dimensionality

Slide 40

Slide 40 text

Fitting multiple GMMs train_gmm = function(df){ densityMclust(df[c("temperature", "humidity", "visibility", "wind_speed")], G=seq(1,6)) } gmms = df %>% nest(-city) %>% mutate(gmm = map(data, train_gmm)) map(gmms$gmm, plot, what=“density”) } Include four columns this time

Slide 41

Slide 41 text

High-dimensional densities Visualizing more than 2 dimensions

Slide 42

Slide 42 text

High-dimensional densities Visualizing more than 2 dimensions London NYC Paris

Slide 43

Slide 43 text

High-dimensional densities Paris

Slide 44

Slide 44 text

High-dimensional densities London

Slide 45

Slide 45 text

High-dimensional densities New York

Slide 46

Slide 46 text

What are the most anomalous days? Controlling for location gmms %>% mutate(score = map(gmm, "density")) %>% select(-gmm) %>% unnest() %>% arrange(score) %>% head(3) City Temperature Humidity Wind  Speed Pressure New York 23 49 159 1015 London 29 40 16 1012 New  York 14 80 30 980 Hurricane winds Insanely hot for London Crazy low pressure

Slide 47

Slide 47 text

Choosing a Threshold Using the mclust package

Slide 48

Slide 48 text

Summary

Slide 49

Slide 49 text

Summary • Semi-supervised learning attemps to learn the distribution of data. Anomalies are then low-probability observations • GMMs provide a quick-and-easy way to estimate probability densities • Mclust is an awesome GMM package for R • Anomalies highly dependent on context (i.e. city) and the variables included in detector (i.e. temperature, humidity, wind speed, pressure). Semi-supervised learning

Slide 50

Slide 50 text

Moving to Production • Train GMM on “normal” data, updated weekly • REST API scores incoming data based on last weeks data • Threshold chosen based on false positive tolerance Semi-supervised learning

Slide 51

Slide 51 text

No content