Real-time anomaly detection with Cassandra, Spark ML and Akka by Natalino Busa at Big Data Spain 2015

Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra Natalino
Busa Data Platform Architect at Ing

Distributed computing Machine Learning Statistics Big/Fast Data Streaming Computing @natalinobusa
| linkedin.com/in/natalinobusa

@natbusa | linkedin: Natalino Busa ING group http://www.ing.com/About-us/Purpose-Strategy.htm

@natbusa | linkedin: Natalino Busa ING group Empowering people to
stay a step ahead in life and in business. http://www.ing.com/About-us/Purpose-Strategy.htm

@natbusa | linkedin: Natalino Busa ING group http://www.ing.com/About-us/Purpose-Strategy.htm Clear and
Easy Anytime, Anywhere Empower Keep getting better

@natbusa | linkedin: Natalino Busa Apply advanced, predictive analytics on
live data Event-Driven and exposed via APIs Lean Architecture, Easy to integrate Available, Consistent, Streaming, Real-time Data Resilient, Distributed, Scalable, Maintainable Clear and Easy Anytime, Anywhere Empower Keep getting better Data Principles ING group

@natbusa | linkedin: Natalino Busa Big Data and Fast Data
10 yrs 5 yrs 1 yr 1 month 1 day 1hour 1m time population: events, transactions, sessions, customers, etc event streams recent data historical big data

@natbusa | linkedin: Natalino Busa Why Fast Data? 1. Relevant
up-to-date information. 2. Delivers actionable events.

@natbusa | linkedin: Natalino Busa Why Big Data? 1. Analyze
and model 2. Learn, cluster, categorize, organize facts

@natbusa | linkedin: Natalino Busa 10 Distributed Data Store Real
Time APIs Streaming Data Data Sources, Files, DB extracts Batched Data API for mobile and web Training, Scoring and Exposing models

@natbusa | linkedin: Natalino Busa 11 Distributed Data Store Fast
Analytics Real Time APIs Streaming Data Data Modeling Data Sources, Files, DB extracts Batched Data API for mobile and web Training, Scoring and Exposing models read the data write the model

Analytics Event Processing Real Time APIs Streaming Data Data Modeling Data Sources, Files, DB extracts Batched Data Alerts and Notifications API for mobile and web Training, Scoring and Exposing models read the model read the data write the model

@natbusa | linkedin: Natalino Busa Cassandra+Akka+Spark: Machine Learning Fast writes
2D Data Structure Replicated Tunable consistency Multi-Data centers C* Akka Spark Very Fast processing Distributed, Scalable computing Actor-based Pipelines Actor state can be persisted Supervision strategies Ad-Hoc Queries Joins, Aggregate User Defined Functions Machine Learning, Advanced Stats and Analytics

@natbusa | linkedin: Natalino Busa Akka-Cassandra-Spark Stack Cassandra-Spark Connector Cassandra
Spark Streaming SQL MLlib Graphx Extract Data Create Models, Enrich, Transform Fetch from other Sources: Kafka Fetch from other Sources: DB’s, Files Akka Analytics, Statistics, Data Science, Model Training Access Model Persist Actors’ State

@natbusa | linkedin: Natalino Busa Cassandra-Spark Connector Cassandra: Store all
the data Spark: Analyze all the data DC1: replication factor 3 DC2: replication factor 3 DC3: replication factor 3 + Spark Executors Storage! Analytics! Data

@natbusa | linkedin: Natalino Busa Cassandra-Spark Stack Cassandra: Store all
the data Spark: Distributed Data Processing Executors and Workers Cassandra-Spark Connector: Data locality, Reduce Shuffling RDD’s to Cassandra Partitions DC3: replication factor 3 + Spark Executors

@natbusa | linkedin: Natalino Busa Data Science: Anomaly Detection An
outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. Hawkins, 1980

@natbusa | linkedin: Natalino Busa Data Science: Anomaly Detection (1)
Parametric Based Gaussian Model Based Histogram, nonparametric

@natbusa | linkedin: Natalino Busa Data Science: Anomaly Detection (2)
Distance Based Density Based

@natbusa | linkedin: Natalino Busa Example: Analyze gowalla check-ins year
| month | day | time | uid | lat | lon | ts | vid ------+-------+-----+------+--------+----------+-----------+--------------------------+--------- 2010 | 9 | 14 | 91 | 853 | 40.73474 | -73.87434 | 2010-09-14 00:01:31+0000 | 917955 2010 | 9 | 14 | 328 | 4516 | 40.72585 | -73.99289 | 2010-09-14 00:05:28+0000 | 37160 2010 | 9 | 14 | 344 | 2964 | 40.67621 | -73.98405 | 2010-09-14 00:05:44+0000 | 956870 Check-ins dataset Venues dataset vid | name | lat | long ------+-------+-----+------+--------+----------+----------- +--------------------------+--------- 754108 | My Suit NY | 40.73474 | -73.87434 249755 | UA Court Street Stadium 12 | 40.72585 | -73.99289 6919688 | Sky Asian Bistro | 40.67621 | -73.98405

@natbusa | linkedin: Natalino Busa Data Science: clustering venues Weekly
visitors patterns! Madison Square, Apple Store, Radio City Music Hall Thursdays, Fridays, Saturdays are busy Statue of Liberty, Jacob K. Javits Convention Center, Whole Foods Market (Columbus Circle) Not popular on midweek Intuition:

@natbusa | linkedin: Natalino Busa Data Science: clustering with k-means
Histograms components as dimensions Similar histograms would occupy similar places in the feature space How do I compare histograms: - EMD - Chi-squared distance - Space transformation (DCT) Intuition:

@natbusa | linkedin: Natalino Busa K-Means: Featurize data + cluster
val weekly_visits = checkins_venues.select("vid","ts") .map(row => (row.getLong("vid"), vectorize_time(s.getTimestamp("ts")) .reduceByKey(_ + _) .mapValues(_ => featurize_histogram(_._1)) val numClusters = 15 val numIterations = 100 val clusters = KMeans.train(weekly_visits, numClusters, numIterations) PairRDDs, weekly patterns per venue cluster similar weekly patterns

@natbusa | linkedin: Natalino Busa Assigning venues to clusters val
venues_clustered = checkins_venues.select("vid","ts").where("ts > dateof(now())") .map(row => (row.getLong("vid"), vectorize_time(s.getTimestamp("ts")) .reduceByKey(_ + _) .mapValues(_ => clusters.predict(featurize_time(_))) venues_clustered.saveToCassandra("lbsn", "venues_cl") score and assign each venue to a cluster Store the cluster centers in cassandra

@natbusa | linkedin: Natalino Busa How to use it 1)
Classification Classify venues to given groups 2) Anomaly Detection Detect shift in the clustering assignment for a given venue for a given week Keep monitoring weekly change in patterns, when it happens trigger a signal week 26 week 27 alert!

@natbusa | linkedin: Natalino Busa Data Science: clustering users’ venues
Intuition: Users tend to stick in the same places People have habits By clustering the places together We can identify anomalous locations Size of the cluster matters More points means less anomalous Mini-clusters and single anomalies are treated in similar ways ...

Users tend to stick in the same places People have habits By clustering the places together We can identify anomalous locations Size of the cluster matters More points means less anomalous Mini-clusters and single anomalies are treated in similar ways ... Intuition:

@natbusa | linkedin: Natalino Busa Data Science: clustering with DBSCAN
DBSCAN find clusters based on neighbouring density Does not require the number of cluster k beforehand. Clusters are not spherical

@natbusa | linkedin: Natalino Busa Data Science: clustering with DBSCAN
DBSCAN find clusters based on neighbouring density Does not require the number of cluster k beforehand. Clusters are not spherical It’s a graph!

val locs = checkins_venues.select("uid", "lat","lon") .map(s => (s.getLong(0), Seq( (s.getDouble(1), s.getDouble(2)) )) .reduceByKey(_ + _) .mapValues( dbscan cluster _ ) Have a look at: scalanlp/nak

@natbusa | linkedin: Natalino Busa Data Science: Two ways to
find anomalies with clustering - Cluster big amount of data with k-means and histograms - Apply clustering independently to million of users, to each identify the patterns with dbscan algorithm

@natbusa | linkedin: Natalino Busa MLlib vs PairRDDs KMeans.train(FeaturesRDD, numClusters,
numIterations) UserFeaturesPairRDD.GroupbyKey().mapValues( dbscan cluster _ ) RDDs map functions Parallelism easy to exploit The function runs locally for each Key Pick your fav machine learning algorithms Limited nr of points Running in parallel for millions of Keys MLlib Truly distributed algorithm Classify venues to given groups Millions of datapoints Limited amount of clusters

Analytics Event Processing Real Time APIs Streaming Data Data Modeling Data Sources, Files, DB extracts Batched Data Alerts and Notifications API for mobile and web Training, Scoring and Exposing models read the model read the data write the model

@natbusa | linkedin: Natalino Busa Training vs Scoring: Time budgets
• Akka: millisecond response • Spark: in-memory (big)-data models Train: Spark Score: Spark Train: Spark Score: Akka slow: minutes fast: millisecs Train: Akka Score: Akka Model Scoring Model Training slow: minutes fast: millisecs

@natbusa | linkedin: Natalino Busa Which processing? .. and the
granularity of data? What is the latency and the throughput?

@natbusa | linkedin: Natalino Busa It’s all about latency! Map
Reduce Big Data Batch based RDDs Big Data Micro-Batch Based CRDT’s + Monoids Fast Data Event Based MillWheel Fast + Big Data Event & Window

@natbusa | linkedin: Natalino Busa Akka Mixed Load Cassandra Cluster
Coral: Web API for dynamic data flows

@natbusa | linkedin: Natalino Busa Data events POST http://coral/api/actors/23/in {
"amount":23.45, "user": 76232, "city": "Berlin" } @natalinobusa | linkedin.com/in/natalinobusa Coral: Streaming data via Web APIs

@natbusa | linkedin: Natalino Busa Trigger Emit Params State GET
http://coral/api/actors/23 { "actors": { "def": { "type": "stats", "params": { "field": "amount" } }, "state": { "count": 134, "avg": 39.84, "min": 1.99, "max": 204.19, "sd": 38.01 } } } Coral: Streaming data via Web APIs

@natbusa | linkedin: Natalino Busa Akka Coral: Web API for
dynamic data flows • a web api to define/manage/run streaming data-flows • open source and community managed • event processing as a service • connect to Cassandra to access models • connect to kafka to consume and produce events Steven Raemaekers Jasper van Zandbeek Ger van Rossum Hoda Alemi Koen Verschuren

Analytics Event Processing Real Time APIs Streaming Data Data Modeling Data Sources, Files, DB extracts Batched Data Alerts and Notifications API for mobile and web Summary: read the model read the data write the model

@natbusa | linkedin: Natalino Busa Akka Feedback to the community:
More Algorithms for machine learning! - DBSCAN, OPTICS, PAM - More metrics, non-euclidean spaces, etc - Non distributed algorithms: more scalanlp integration? Streaming all the way: Unify batch (Spark) and event streaming (Akka) computing

@natbusa | linkedin: Natalino Busa Thanks! - Vision and strategy
on an event-driven bank - ING CIO management team and awesome colleagues Spark, Cassandra, Akka communities !

@natbusa | linkedin: Natalino Busa Resources Coral: event processing webapi
https://github.com/coral-streaming/coral Spark + Cassandra: Clustering Events http://www.natalinobusa.com/2015/07/clustering-check-ins-with-spark-and.html Spark: Machine Learning, SQL frames https://spark.apache.org/docs/latest/mllib-guide.html https://spark.apache.org/docs/latest/sql-programming-guide.html Datastax: Analytics and Spark connector http://www.slideshare.net/doanduyhai/spark-cassandra-connector-api-best-practices-and-usecases http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/anaHome/anaHome.html Anomaly Detection Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey"(PDF). ACM Computing Surveys 41 (3): 1. doi:10.1145/1541880.1541882.

@natbusa | linkedin: Natalino Busa Resources Datasets https://snap.stanford.edu/data/loc-gowalla.html E. Cho,
S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2011 https://code.google.com/p/locrec/downloads/detail?name=gowalla-dataset.zip The project is being developed in the context of the SInteliGIS project financed by the Portuguese Foundation for Science and Technology (FCT) through project grant PTDC/EIA-EIA/109840/2009. . Pictures: "DBSCAN-density-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-density-data. svg#/media/File:DBSCAN-density-data.svg "DBSCAN-Illustration" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-Illustration.svg#/media/File: DBSCAN-Illustration.svg "Multimodal" by Visnut - Own work. Licensed under CC BY-SA 4.0 via Commons - https://commons.wikimedia.org/wiki/File:Multimodal.png#/media/File:Multimodal.png "Standard deviation diagram" by Mwtoews - Own work, based (in concept) on figure by Jeremy Kemp, on 2005-02-09. Licensed under CC BY 2.5 via Commons - https: //commons.wikimedia.org/wiki/File:Standard_deviation_diagram.svg#/media/File:Standard_deviation_diagram.svg "Michelsonmorley-boxplot" by User:Schutz - Own work. Licensed under Public Domain via Commons - https://commons.wikimedia.org/wiki/File:Michelsonmorley-boxplot. svg#/media/File:Michelsonmorley-boxplot.svg

Real-time anomaly detection with Cassandra, Spa...

Real-time anomaly detection with Cassandra, Spark ML and Akka by Natalino Busa at Big Data Spain 2015

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript