Real-time anomaly detection with Cassandra, Spark ML and Akka by Natalino Busa at Big Data Spain 2015

Real-time anomaly detection with Cassandra, Spark ML and Akka by Natalino Busa at Big Data Spain 2015

Banks are innovating. The purpose of this innovation is to transform bank services into meaningful and frictionless customer experiences. A key element in order to achieve that ambitious goal is by providing well tailored and reactive APIs and provide them as the building blocks for greater and smoother customer journeys and experiences. For these API’s to work, internal processes have to evolve as well from batch processing to real time event processing.

Session presented at Big Data Spain 2015 Conference
16th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/fri/slot-35.html

Cb6e6da05b5b943d2691ceefa3381cad?s=128

Big Data Spain

October 21, 2015
Tweet

Transcript

  1. None
  2. Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra Natalino

    Busa Data Platform Architect at Ing
  3. Distributed computing Machine Learning Statistics Big/Fast Data Streaming Computing @natalinobusa

    | linkedin.com/in/natalinobusa
  4. @natbusa | linkedin: Natalino Busa ING group http://www.ing.com/About-us/Purpose-Strategy.htm

  5. @natbusa | linkedin: Natalino Busa ING group Empowering people to

    stay a step ahead in life and in business. http://www.ing.com/About-us/Purpose-Strategy.htm
  6. @natbusa | linkedin: Natalino Busa ING group http://www.ing.com/About-us/Purpose-Strategy.htm Clear and

    Easy Anytime, Anywhere Empower Keep getting better
  7. @natbusa | linkedin: Natalino Busa Apply advanced, predictive analytics on

    live data Event-Driven and exposed via APIs Lean Architecture, Easy to integrate Available, Consistent, Streaming, Real-time Data Resilient, Distributed, Scalable, Maintainable Clear and Easy Anytime, Anywhere Empower Keep getting better Data Principles ING group
  8. @natbusa | linkedin: Natalino Busa Big Data and Fast Data

    10 yrs 5 yrs 1 yr 1 month 1 day 1hour 1m time population: events, transactions, sessions, customers, etc event streams recent data historical big data
  9. @natbusa | linkedin: Natalino Busa Why Fast Data? 1. Relevant

    up-to-date information. 2. Delivers actionable events.
  10. @natbusa | linkedin: Natalino Busa Why Big Data? 1. Analyze

    and model 2. Learn, cluster, categorize, organize facts
  11. @natbusa | linkedin: Natalino Busa 10 Distributed Data Store Real

    Time APIs Streaming Data Data Sources, Files, DB extracts Batched Data API for mobile and web Training, Scoring and Exposing models
  12. @natbusa | linkedin: Natalino Busa 11 Distributed Data Store Fast

    Analytics Real Time APIs Streaming Data Data Modeling Data Sources, Files, DB extracts Batched Data API for mobile and web Training, Scoring and Exposing models read the data write the model
  13. @natbusa | linkedin: Natalino Busa 12 Distributed Data Store Fast

    Analytics Event Processing Real Time APIs Streaming Data Data Modeling Data Sources, Files, DB extracts Batched Data Alerts and Notifications API for mobile and web Training, Scoring and Exposing models read the model read the data write the model
  14. @natbusa | linkedin: Natalino Busa Cassandra+Akka+Spark: Machine Learning Fast writes

    2D Data Structure Replicated Tunable consistency Multi-Data centers C* Akka Spark Very Fast processing Distributed, Scalable computing Actor-based Pipelines Actor state can be persisted Supervision strategies Ad-Hoc Queries Joins, Aggregate User Defined Functions Machine Learning, Advanced Stats and Analytics
  15. @natbusa | linkedin: Natalino Busa Akka-Cassandra-Spark Stack Cassandra-Spark Connector Cassandra

    Spark Streaming SQL MLlib Graphx Extract Data Create Models, Enrich, Transform Fetch from other Sources: Kafka Fetch from other Sources: DB’s, Files Akka Analytics, Statistics, Data Science, Model Training Access Model Persist Actors’ State
  16. @natbusa | linkedin: Natalino Busa Cassandra-Spark Connector Cassandra: Store all

    the data Spark: Analyze all the data DC1: replication factor 3 DC2: replication factor 3 DC3: replication factor 3 + Spark Executors Storage! Analytics! Data
  17. @natbusa | linkedin: Natalino Busa Cassandra-Spark Stack Cassandra: Store all

    the data Spark: Distributed Data Processing Executors and Workers Cassandra-Spark Connector: Data locality, Reduce Shuffling RDD’s to Cassandra Partitions DC3: replication factor 3 + Spark Executors
  18. @natbusa | linkedin: Natalino Busa Data Science: Anomaly Detection An

    outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. Hawkins, 1980
  19. @natbusa | linkedin: Natalino Busa Data Science: Anomaly Detection (1)

    Parametric Based Gaussian Model Based Histogram, nonparametric
  20. @natbusa | linkedin: Natalino Busa Data Science: Anomaly Detection (2)

    Distance Based Density Based
  21. @natbusa | linkedin: Natalino Busa Example: Analyze gowalla check-ins year

    | month | day | time | uid | lat | lon | ts | vid ------+-------+-----+------+--------+----------+-----------+--------------------------+--------- 2010 | 9 | 14 | 91 | 853 | 40.73474 | -73.87434 | 2010-09-14 00:01:31+0000 | 917955 2010 | 9 | 14 | 328 | 4516 | 40.72585 | -73.99289 | 2010-09-14 00:05:28+0000 | 37160 2010 | 9 | 14 | 344 | 2964 | 40.67621 | -73.98405 | 2010-09-14 00:05:44+0000 | 956870 Check-ins dataset Venues dataset vid | name | lat | long ------+-------+-----+------+--------+----------+----------- +--------------------------+--------- 754108 | My Suit NY | 40.73474 | -73.87434 249755 | UA Court Street Stadium 12 | 40.72585 | -73.99289 6919688 | Sky Asian Bistro | 40.67621 | -73.98405
  22. @natbusa | linkedin: Natalino Busa Data Science: clustering venues Weekly

    visitors patterns! Madison Square, Apple Store, Radio City Music Hall Thursdays, Fridays, Saturdays are busy Statue of Liberty, Jacob K. Javits Convention Center, Whole Foods Market (Columbus Circle) Not popular on midweek Intuition:
  23. @natbusa | linkedin: Natalino Busa Data Science: clustering with k-means

    Histograms components as dimensions Similar histograms would occupy similar places in the feature space How do I compare histograms: - EMD - Chi-squared distance - Space transformation (DCT) Intuition:
  24. @natbusa | linkedin: Natalino Busa K-Means: Featurize data + cluster

    val weekly_visits = checkins_venues.select("vid","ts") .map(row => (row.getLong("vid"), vectorize_time(s.getTimestamp("ts")) .reduceByKey(_ + _) .mapValues(_ => featurize_histogram(_._1)) val numClusters = 15 val numIterations = 100 val clusters = KMeans.train(weekly_visits, numClusters, numIterations) PairRDDs, weekly patterns per venue cluster similar weekly patterns
  25. @natbusa | linkedin: Natalino Busa Assigning venues to clusters val

    venues_clustered = checkins_venues.select("vid","ts").where("ts > dateof(now())") .map(row => (row.getLong("vid"), vectorize_time(s.getTimestamp("ts")) .reduceByKey(_ + _) .mapValues(_ => clusters.predict(featurize_time(_))) venues_clustered.saveToCassandra("lbsn", "venues_cl") score and assign each venue to a cluster Store the cluster centers in cassandra
  26. @natbusa | linkedin: Natalino Busa How to use it 1)

    Classification Classify venues to given groups 2) Anomaly Detection Detect shift in the clustering assignment for a given venue for a given week Keep monitoring weekly change in patterns, when it happens trigger a signal week 26 week 27 alert!
  27. @natbusa | linkedin: Natalino Busa Data Science: clustering users’ venues

    Intuition: Users tend to stick in the same places People have habits By clustering the places together We can identify anomalous locations Size of the cluster matters More points means less anomalous Mini-clusters and single anomalies are treated in similar ways ...
  28. @natbusa | linkedin: Natalino Busa Data Science: clustering users’ venues

    Users tend to stick in the same places People have habits By clustering the places together We can identify anomalous locations Size of the cluster matters More points means less anomalous Mini-clusters and single anomalies are treated in similar ways ... Intuition:
  29. @natbusa | linkedin: Natalino Busa Data Science: clustering with DBSCAN

    DBSCAN find clusters based on neighbouring density Does not require the number of cluster k beforehand. Clusters are not spherical
  30. @natbusa | linkedin: Natalino Busa Data Science: clustering with DBSCAN

    DBSCAN find clusters based on neighbouring density Does not require the number of cluster k beforehand. Clusters are not spherical It’s a graph!
  31. @natbusa | linkedin: Natalino Busa Data Science: clustering users’ venues

    val locs = checkins_venues.select("uid", "lat","lon") .map(s => (s.getLong(0), Seq( (s.getDouble(1), s.getDouble(2)) )) .reduceByKey(_ + _) .mapValues( dbscan cluster _ ) Have a look at: scalanlp/nak
  32. @natbusa | linkedin: Natalino Busa Data Science: Two ways to

    find anomalies with clustering - Cluster big amount of data with k-means and histograms - Apply clustering independently to million of users, to each identify the patterns with dbscan algorithm
  33. @natbusa | linkedin: Natalino Busa MLlib vs PairRDDs KMeans.train(FeaturesRDD, numClusters,

    numIterations) UserFeaturesPairRDD.GroupbyKey().mapValues( dbscan cluster _ ) RDDs map functions Parallelism easy to exploit The function runs locally for each Key Pick your fav machine learning algorithms Limited nr of points Running in parallel for millions of Keys MLlib Truly distributed algorithm Classify venues to given groups Millions of datapoints Limited amount of clusters
  34. @natbusa | linkedin: Natalino Busa 33 Distributed Data Store Fast

    Analytics Event Processing Real Time APIs Streaming Data Data Modeling Data Sources, Files, DB extracts Batched Data Alerts and Notifications API for mobile and web Training, Scoring and Exposing models read the model read the data write the model
  35. @natbusa | linkedin: Natalino Busa Training vs Scoring: Time budgets

    • Akka: millisecond response • Spark: in-memory (big)-data models Train: Spark Score: Spark Train: Spark Score: Akka slow: minutes fast: millisecs Train: Akka Score: Akka Model Scoring Model Training slow: minutes fast: millisecs
  36. @natbusa | linkedin: Natalino Busa Which processing? .. and the

    granularity of data? What is the latency and the throughput?
  37. @natbusa | linkedin: Natalino Busa It’s all about latency! Map

    Reduce Big Data Batch based RDDs Big Data Micro-Batch Based CRDT’s + Monoids Fast Data Event Based MillWheel Fast + Big Data Event & Window
  38. @natbusa | linkedin: Natalino Busa Akka Mixed Load Cassandra Cluster

    Coral: Web API for dynamic data flows
  39. @natbusa | linkedin: Natalino Busa Data events POST http://coral/api/actors/23/in {

    "amount":23.45, "user": 76232, "city": "Berlin" } @natalinobusa | linkedin.com/in/natalinobusa Coral: Streaming data via Web APIs
  40. @natbusa | linkedin: Natalino Busa Trigger Emit Params State GET

    http://coral/api/actors/23 { "actors": { "def": { "type": "stats", "params": { "field": "amount" } }, "state": { "count": 134, "avg": 39.84, "min": 1.99, "max": 204.19, "sd": 38.01 } } } Coral: Streaming data via Web APIs
  41. @natbusa | linkedin: Natalino Busa Akka Coral: Web API for

    dynamic data flows • a web api to define/manage/run streaming data-flows • open source and community managed • event processing as a service • connect to Cassandra to access models • connect to kafka to consume and produce events Steven Raemaekers Jasper van Zandbeek Ger van Rossum Hoda Alemi Koen Verschuren
  42. @natbusa | linkedin: Natalino Busa 41 Distributed Data Store Fast

    Analytics Event Processing Real Time APIs Streaming Data Data Modeling Data Sources, Files, DB extracts Batched Data Alerts and Notifications API for mobile and web Summary: read the model read the data write the model
  43. @natbusa | linkedin: Natalino Busa Akka Feedback to the community:

    More Algorithms for machine learning! - DBSCAN, OPTICS, PAM - More metrics, non-euclidean spaces, etc - Non distributed algorithms: more scalanlp integration? Streaming all the way: Unify batch (Spark) and event streaming (Akka) computing
  44. @natbusa | linkedin: Natalino Busa Thanks! - Vision and strategy

    on an event-driven bank - ING CIO management team and awesome colleagues Spark, Cassandra, Akka communities !
  45. @natbusa | linkedin: Natalino Busa Resources Coral: event processing webapi

    https://github.com/coral-streaming/coral Spark + Cassandra: Clustering Events http://www.natalinobusa.com/2015/07/clustering-check-ins-with-spark-and.html Spark: Machine Learning, SQL frames https://spark.apache.org/docs/latest/mllib-guide.html https://spark.apache.org/docs/latest/sql-programming-guide.html Datastax: Analytics and Spark connector http://www.slideshare.net/doanduyhai/spark-cassandra-connector-api-best-practices-and-usecases http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/anaHome/anaHome.html Anomaly Detection Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey"(PDF). ACM Computing Surveys 41 (3): 1. doi:10.1145/1541880.1541882.
  46. @natbusa | linkedin: Natalino Busa Resources Datasets https://snap.stanford.edu/data/loc-gowalla.html E. Cho,

    S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2011 https://code.google.com/p/locrec/downloads/detail?name=gowalla-dataset.zip The project is being developed in the context of the SInteliGIS project financed by the Portuguese Foundation for Science and Technology (FCT) through project grant PTDC/EIA-EIA/109840/2009. . Pictures: "DBSCAN-density-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-density-data. svg#/media/File:DBSCAN-density-data.svg "DBSCAN-Illustration" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-Illustration.svg#/media/File: DBSCAN-Illustration.svg "Multimodal" by Visnut - Own work. Licensed under CC BY-SA 4.0 via Commons - https://commons.wikimedia.org/wiki/File:Multimodal.png#/media/File:Multimodal.png "Standard deviation diagram" by Mwtoews - Own work, based (in concept) on figure by Jeremy Kemp, on 2005-02-09. Licensed under CC BY 2.5 via Commons - https: //commons.wikimedia.org/wiki/File:Standard_deviation_diagram.svg#/media/File:Standard_deviation_diagram.svg "Michelsonmorley-boxplot" by User:Schutz - Own work. Licensed under Public Domain via Commons - https://commons.wikimedia.org/wiki/File:Michelsonmorley-boxplot. svg#/media/File:Michelsonmorley-boxplot.svg