Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Entity Centric Indexes

Building Entity Centric Indexes

This talk was presented at the inaugural Elastic{ON} conference, http://elasticon.com

Session Abstract:

Sometimes we need to step back and take a look at the bigger picture - not just counting huge piles of individual log records, but reasoning about the behaviors of the people who are ultimately generating this firehose of data. While your DevOps folks care deeply about log records from a machine utlization perspective, marketing wants to know what these records tell us about the customers' needs.

Elasticsearch Aggregations are a great feature but are not a panacea. We can happily use them to summarise complex things like the number of web requests per day broken down by geography and browser type on a busy website, but we would quickly run out of memory if we tried to calculate something as simple as a single number for the average duration of visitor web sessions when using the very same dataset.

Why does this occur? A web session duration is an example of a behavioural attribute not held on any one log record; it has to be derived by finding the first and last records for each session in our weblogs, requiring some complex query expressions and a lot of memory to connect all the data points.

We can maintain a more useful joined-up-picture if we run an ongoing background process to fuse related events from one index into ?entity-centric? summaries in another index e.g:

Web log events summarised into ?web session? entities
Road-worthiness test results summarised into ?car? entities
Reviews in a marketplace summarised into a ?reviewer? entity

Using real data, this session will demonstrate how to incrementally build entity-centric indexes alongside event-centric indexes by using simple scripts to uncover interesting behaviours that accumulate over time. We'll explore:

* Which cars are driven long distances after failing roadworthiness tests?
* Which website visitors look to be behaving like ?bots??
* Which seller in my marketplace has employed an army of ?shills? to boost his feedback rating?

Attendees will leave this session with all the tools required to begin building entity-centric indexes and using that data to derive richer business insights across every department in their organization.

Elastic Co

March 10, 2015
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. { } CC-BY-ND 4.0 Problem: some aggregations are expensive We

    need to join all event-level data together at query-time. ? Using web server log data, answer the question: "how long on average do customers spend on my site?" !
  2. { } CC-BY-ND 4.0 How to cripple elasticsearch with a

    bucket explosion: 1. Ask a question about values that needs to be derived from multiple documents (e.g. deriving a web session’s duration) 2. Make the joining key a high cardinality field e.g. something like “IP address” 3. Extra points if you use no routing of your documents so that related content is spray-gunned across multiple shards
  3. { } CC-BY-ND 4.0 Solution: an “entity-centric” model 7 Usual

    stream of events Time-based event indexes Entity-based summary indexes Periodic extracts sorted by entity ID and time
  4. { } CC-BY-ND 4.0 Entity-centric queries • WebSessions • "how

    long on average do my customers spend on my site?” • “which users behave like bots?” • “what is the most common exit page?” • Bank Accounts • "Does this new payment match the typical spending behaviour of bank account X?” 8 Analysis of behaviours over time
  5. { } CC-BY-ND 4.0 Entity-centric queries • Buyers • "What

    do the users who bought product X also buy?” • “Which buyers behave like ‘shills’ and who are they promoting?” • Cars • “Which cars drove long distances after failing a road worthiness test?” 9 Analysis of behaviours over time
  6. { } CC-BY-ND 4.0 Use case: GFORCES • Analyses website

    traffic for retailers and manufacturers in the automotive industry • Summarising many behaviours over time e.g. • unique numbers of visitors per month • engagement: average session durations • Faced scaling issues producing some results from raw events 11
  7. { } CC-BY-ND 4.0 Results of moving to entity-centric indexing

    • Data store contains 150m events generated by 26m user sessions • Event-centric aggregations were taking ~25 seconds • Equivalent entity-centric aggregations take <50ms • Simplified queries for common entry pages, common exit pages etc 12
  8. { } CC-BY-ND 4.0 Worked example Amazon marketplace reviews -

    building profiles for reviewers 13 Play  along!  Code  +  data  here:  bit.ly/entcent
  9. { } CC-BY-ND 4.0 An “entity-centric” model 14 AmazonReviews (an

    event-centric index) reviews.csv loadEvents.sh Review event fields • rating • seller • reviewer • date AmazonReviewers (an entity-centric index) buildEntities.sh • Drops and creates reviewers index. • Uses Python client to query and scroll list of reviews sorted by reviewerId and time • Python pushes _update requests to ~400k “Reviewer” documents each containing bundles of their recent reviews using bulk indexing API • Shard-side Groovy script collapses the multiple reviews into a single reviewer JSON document summarising behaviour Reviewer entity fields • positivity • num sellers reviewed • last 50 reviews • profile (“newbie”, “fanboy” etc)
  10. { } CC-BY-ND 4.0 Anatomy of an entity indexing groovy

    script 15 Initialize  if  new  document Loop  to  consolidate  latest  events Re-­‐run  risk  profile  logic   Load  stored  state Store  the  script  in  ES_HOME/config/scripts/foo.groovy
  11. { } CC-BY-ND 4.0 Insight: which sellers have a lot

    of fanboys? 16 Seller  #187  has  more  than  his   fair  share  of  “fanboy”   reviewers  …
  12. { } CC-BY-ND 4.0 Drilling down into seller #187’s fanboys

    17 Suspiciously   synchronised   behaviour
  13. { } CC-BY-ND 4.0 Example background • In the UK

    all vehicles must pass an annual roadworthiness test, called an MOT (named after the Ministry of Transport) • It is illegal to drive a car that has failed an MOT (unless driving home from a test or to a repair centre) • Taxis and other forms of public transport have to be tested more frequently - every 6 months. • All data is freely available from data.gov.uk but with anonymised vehicle ID and inexact test locations. 19 MOT dataset
  14. { } CC-BY-ND 4.0 Example background 20 MOTs mots.csv loadMOTs.sh

    Cars buildEntities.sh • Drops and creates mots index. • Uses Python client to bulk load all 37m road worthiness test results for 2013 (data source http://data.gov.uk/ dataset/ • Drops and creates cars index. • Registers CarProfileUpdater.groovy as a stored script • Uses Python client to query and scroll list of mot test results sorted by vehicle ID and time • Python pushes _update requests to ~27m “Car” documents each containing bundles of related MOT test results using bulk indexing API • Shard-side Groovy script collapses the multiple tests into a single summary JSON document for a car, deriving summaries eg any mileometer-reading discrepancies MOT event fields • result (pass/fail) • vehicle ID • Make + model + age • mileage • test date • test location Car entity fields • Make + model + age • last test result, date, location • miles driven while failed • days between fail and fix • complete test history • suspected bad mileometer readings
  15. { } CC-BY-ND 4.0 Data fusion logic 21 Car attributes

    derived from 3 test result documents 1 2 3 Test  date Mile-­‐o-­‐meter  reading daysForFix badReading? milesDrivenAfterFailure mile-o-meterRewind
  16. { } CC-BY-ND 4.0 Insight: who is driving failed vehicles?

    22 Q: Why is there an unexpected peak in milesDrivenWithFailure around 6-months? A: Taxis
  17. { } CC-BY-ND 4.0 Example background • A public dataset*

    of 10m movie ratings made by 71k users • One elasticsearch document per user with a list of their movie ratings 25 Movielens data *  http://files.grouplens.org/datasets/movielens/ml-­‐10m-­‐README.html
  18. { } CC-BY-ND 4.0 Entity centric indexing • Efficient and

    simple queries • Advanced analytics/insights • Can provide a cheaper data retention policy (daily->weekly->monthly roll-ups) • Can reuse existing elasticsearch APIs or build entity documents using external technologies 28 Advantages
  19. { } CC-BY-ND 4.0 Entity centric indexing • Avoid “fat

    entities” • Use forgetful collections: Priority queues, circular buffers, HyperLogLog • Avoid pointless updates • Use ctx.op=“none” to avoid writes of insignificant changes • Consider options for reducing event volumes: • Use of aggregations in gathering events • Reduce related events in event-gathering script that issues updates • Parallelise the pull of event information • Consider use of nested docs if you have “cross-matching” problems 29 Tips
  20. { } CC-BY-ND 4.0 Entity centric indexing • Incremental entity

    updates can be achieved by querying all events since the timestamp of the last run • Data integrity - implement policies for: • handling any failures in performing entity updates • retiring old entities (use of TTL?) 30 Synchronisation tips
  21. { } This work is licensed under the Creative Commons

    Attribution-NoDerivatives 4.0 International License. To view a copy of this license, visit: http://creativecommons.org/licenses/by-nd/4.0/ or send a letter to: Creative Commons PO Box 1866 Mountain View, CA 94042 USA CC-BY-ND 4.0