Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Real-time Analytics and Anomalies Detection using Elasticsearch, Hadoop and Storm

Real-time Analytics and Anomalies Detection using Elasticsearch, Hadoop and Storm

Finding relevant information fast has always been a challenge, even more so in today’s growing “oceans” of data. This talk explores the area of real-time analytics and anomalies detection (in particular credit card fraud) using Apache Hadoop as a data platform, Apache Storm for real-time computation, data ingestion and orchestration and Elasticsearch for performing advanced real-time searches. This session will focus on the architectural challenges of bridging batch and real-time systems and how to overcome them, keeping a close eye on performance and scalability. We will cover the architectural topics such as partition strategies, data locality, integration patterns and multi-tenancy.

Presented by Costin Leau at Hadoop Summit North America 2014

098332e9d988080a9057816f84d668f7?s=128

Elasticsearch Inc

June 03, 2014
Tweet

Transcript

  1. Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission

    is strictly prohibited Real-time Analytics & Anomaly detection
 using Hadoop, Elasticsearch and Storm Costin Leau @costinl
  2. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited
  3. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Interesting != Common Datasets tend to have hot / common entities Monopolize the data set Create too much noise Cannot be easily avoided Common = frequent Interesting = frequently different
  4. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Finding the uncommon Background vs foreground == things that stand out Example: Background: “flu” “H5N1” appears in 5 / 10M docs H5N1 flu
  5. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Finding the uncommon Background vs foreground == things that stand out Example: Background: “flu” “H5N1” appears in 5 / 10M docs Foreground: “bird flu” “H5N1” appears in 4 / 100 docs H5N1 flu H5N1 bird flu
  6. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Finding the uncommon - Challenges Deal with big data sets •  Hadoop Perform the analysis •  Elasticsearch Keep the data fresh •  Storm
  7. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Hadoop De-facto platform for big data HDFS - Used for storing and performing ETL at scale Map/Reduce - Excellent for iterating, thorough analysis YARN – Job scheduling and resource management
  8. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Elasticsearch Open-source real-time search and analytics engine • Fully-featured search Relevance-ranked text search Scalable search High-performance geo, temporal, range and key lookup Highlighting Support for complex / nested document types * Spelling suggestions Powerful query DSL * “Standing” queries * Real-time results * Extensible via plugins * • Powerful faceting/analysis Summarize large sets by any combinations of time, geo, category and more. * “Kibana” visualization tool * * Features we see as differentiators • Management Simple and robust deployments * REST APIs for handling all aspects of administration/monitoring * “Marvel” console for monitoring and administering clusters * Special features to manage the life cycle of content * • Integration Hadoop (Map/Red,Hive,Pig,Cascading..)* Client libraries (Python, Java, Ruby, javascript…) Data connectors (Twitter, JMS…) Logstash ETL framework * • Support Development and Production support with tiered levels Support staff are the core developers of the product *
  9. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Elasticsearch Open-source real-time search and analytics engine
  10. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Elasticsearch Hadoop Use Elasticsearch natively in Hadoop ‣  Map/Reduce – Input/OutputFormat   ‣  Apache Pig – Storage   ‣  Apache Hive – External Table   ‣  Cascading – Tap/Sink   ‣  Storm (in development) – Spout  /  Bolt   All operations (reads/writes) are parallelized (Map/Reduce)  
  11. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Storm Distributed, fault-tolerant, real-time computation system Perform on-the-fly queries React to live data Prevention Routing
  12. Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission

    is strictly prohibited Discovering the relevant
  13. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Inverted index Inverting Shakespeare ‣  Take all the plays and break them down word by word ‣  For each word, store the ids of the documents that contain it ‣  Sort all tokens (words) token doc freq. postings (doc ids) Anthony 2 1, 2 Brutus 1 5 Caesar 2 2, 3 Calpurnia 2 4, 5
  14. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Relevancy How well does a document match a query? step query d1 d2 The text brown fox The quick brown fox likes brown nuts The red fox The terms (brown, fox) (brown, brown, fox, likes, nuts, quick) (red, fox) A frequency vector (1, 1) (2, 1) (0, 1) Relevancy - 2? 1?
  15. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Relevancy - Vector Space Model •  How well q matches d1 and d2? ‣  The coordinates in the vector represent weights per term ‣  The simple (1, 0) vector we discussed defines these weights based on the frequency of each term ‣  But to generalize: . 2 1 1 . tf: brown tf: fox q: (brown, fox) d1: (brown, brown, fox) d2: (fox)
  16. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Relevancy TF-IDF Term frequency / Inverse Document Frequency TF = the more a token appears in a doc, the more important it is IDF = the more documents containing the term, the less important it is
  17. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Ranking Formula Called Lucene Similarity Can be ignored (was an attempt to make query scores comparable across indices, it’s there for backward compatibility) Core TF/IDF weight Score of a document for a given query Normalized doc length, shorter docs are more likely to be relevant than longer docs Boost of query term t
  18. Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission

    is strictly prohibited Discovering the interesting
  19. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Frequency differentiator TF-IDF by-itself is not enough need to compare the DF in foreground vs background Precision vs Recall balance
  20. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Single-set analysis A C F H I K A B C D E … X Y Z W Query results Dataset
  21. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Single-set analysis example crimes bicycle theft crimes bicycle theft British Police Force British Transport Police
  22. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Multi-set analysis A B C D E … X Y Z W A C F H I K M Q R … Query results Dataset A B C D .. J L M N O .. U Aggregate
  23. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Aggregation (geo-aggregation)
  24. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Aggregation + Analysis
  25. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Hadoop Off-line / slow learning ‣  In-depth analysis ‣  Break down data into hot spots ‣  Eliminate noise ‣  Build multiple models
  26. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Elasticsearch Search features ‣  Scoring, TF-IDF ‣  Significant terms (multi-set analysis) Aggregations ‣  Buckets & Metrics
  27. Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission

    is strictly prohibited Reacting to data
  28. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Reacting to data Prevent execute queries as data flows in à build a model Route place suspicious data into a dedicate pipeline
  29. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Reacting to data spout bolt bolt bolt bolt bolt bolt bolt bolt bolt
  30. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Live loop Data keeps changing ‣  Adapt the set of rules Improves reaction time ‣  Build a model for fast decision making Keeps the prevention rate high ‣  Categorize data on the fly bolt
  31. Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission

    is strictly prohibited Putting it all together
  32. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited The Big Picture HDFS Slow, in-depth learning Fast, real-time learning ETL
  33. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Usages Recommendation ‣  Find similar movies based on user feedback ‣  Use Storm to optimize the returned results Card Fraud ‣  Use Storm to prevent suspicious transactions from executing ‣  Route possible frauds to a dedicated analysis queue
  34. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited
  35. Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission

    is strictly prohibited Q&A Thank you! @costinl