Building Entity Centric Indexes

Entity-centric indexing Mark Harwood, core developer

{ } CC-BY-ND 4.0 Entity-centric indexes (or “when aggregations don’t
cut it”) 2

{ } CC-BY-ND 4.0 A typical “event-centric” deployment 3 Time-based
event indexes Event stream

{ } CC-BY-ND 4.0 Problem: some aggregations are expensive We
need to join all event-level data together at query-time. ? Using web server log data, answer the question: "how long on average do customers spend on my site?" !

{ } CC-BY-ND 4.0 How to cripple elasticsearch with a
bucket explosion: 1. Ask a question about values that needs to be derived from multiple documents (e.g. deriving a web session’s duration) 2. Make the joining key a high cardinality ﬁeld e.g. something like “IP address” 3. Extra points if you use no routing of your documents so that related content is spray-gunned across multiple shards

{ } CC-BY-ND 4.0 Solution A “pay-as-you-go” model to the
costs of fusing data 6

{ } CC-BY-ND 4.0 Solution: an “entity-centric” model 7 Usual
stream of events Time-based event indexes Entity-based summary indexes Periodic extracts sorted by entity ID and time

{ } CC-BY-ND 4.0 Entity-centric queries • WebSessions • "how
long on average do my customers spend on my site?” • “which users behave like bots?” • “what is the most common exit page?” • Bank Accounts • "Does this new payment match the typical spending behaviour of bank account X?” 8 Analysis of behaviours over time

{ } CC-BY-ND 4.0 Entity-centric queries • Buyers • "What
do the users who bought product X also buy?” • “Which buyers behave like ‘shills’ and who are they promoting?” • Cars • “Which cars drove long distances after failing a road worthiness test?” 9 Analysis of behaviours over time

{ } CC-BY-ND 4.0 Use case Web log analytics 10

{ } CC-BY-ND 4.0 Use case: GFORCES • Analyses website
trafﬁc for retailers and manufacturers in the automotive industry • Summarising many behaviours over time e.g. • unique numbers of visitors per month • engagement: average session durations • Faced scaling issues producing some results from raw events 11

{ } CC-BY-ND 4.0 Results of moving to entity-centric indexing
• Data store contains 150m events generated by 26m user sessions • Event-centric aggregations were taking ~25 seconds • Equivalent entity-centric aggregations take <50ms • Simpliﬁed queries for common entry pages, common exit pages etc 12

{ } CC-BY-ND 4.0 Worked example Amazon marketplace reviews -
building profiles for reviewers 13 Play along! Code + data here: bit.ly/entcent

{ } CC-BY-ND 4.0 An “entity-centric” model 14 AmazonReviews (an
event-centric index) reviews.csv loadEvents.sh Review event fields • rating • seller • reviewer • date AmazonReviewers (an entity-centric index) buildEntities.sh • Drops and creates reviewers index. • Uses Python client to query and scroll list of reviews sorted by reviewerId and time • Python pushes _update requests to ~400k “Reviewer” documents each containing bundles of their recent reviews using bulk indexing API • Shard-side Groovy script collapses the multiple reviews into a single reviewer JSON document summarising behaviour Reviewer entity fields • positivity • num sellers reviewed • last 50 reviews • profile (“newbie”, “fanboy” etc)

{ } CC-BY-ND 4.0 Anatomy of an entity indexing groovy
script 15 Initialize if new document Loop to consolidate latest events Re-‐run risk profile logic Load stored state Store the script in ES_HOME/config/scripts/foo.groovy

{ } CC-BY-ND 4.0 Insight: which sellers have a lot
of fanboys? 16 Seller #187 has more than his fair share of “fanboy” reviewers …

{ } CC-BY-ND 4.0 Drilling down into seller #187’s fanboys
17 Suspiciously synchronised behaviour

{ } CC-BY-ND 4.0 Worked example UK 2013 car road
worthiness tests 18

{ } CC-BY-ND 4.0 Example background • In the UK
all vehicles must pass an annual roadworthiness test, called an MOT (named after the Ministry of Transport) • It is illegal to drive a car that has failed an MOT (unless driving home from a test or to a repair centre) • Taxis and other forms of public transport have to be tested more frequently - every 6 months. • All data is freely available from data.gov.uk but with anonymised vehicle ID and inexact test locations. 19 MOT dataset

{ } CC-BY-ND 4.0 Example background 20 MOTs mots.csv loadMOTs.sh
Cars buildEntities.sh • Drops and creates mots index. • Uses Python client to bulk load all 37m road worthiness test results for 2013 (data source http://data.gov.uk/ dataset/ • Drops and creates cars index. • Registers CarProfileUpdater.groovy as a stored script • Uses Python client to query and scroll list of mot test results sorted by vehicle ID and time • Python pushes _update requests to ~27m “Car” documents each containing bundles of related MOT test results using bulk indexing API • Shard-side Groovy script collapses the multiple tests into a single summary JSON document for a car, deriving summaries eg any mileometer-reading discrepancies MOT event fields • result (pass/fail) • vehicle ID • Make + model + age • mileage • test date • test location Car entity fields • Make + model + age • last test result, date, location • miles driven while failed • days between fail and fix • complete test history • suspected bad mileometer readings

{ } CC-BY-ND 4.0 Data fusion logic 21 Car attributes
derived from 3 test result documents 1 2 3 Test date Mile-‐o-‐meter reading daysForFix badReading? milesDrivenAfterFailure mile-o-meterRewind

{ } CC-BY-ND 4.0 Insight: who is driving failed vehicles?
22 Q: Why is there an unexpected peak in milesDrivenWithFailure around 6-months? A: Taxis

{ } CC-BY-ND 4.0 Insight: Taxis keep on trucking after
failures.. 23

{ } CC-BY-ND 4.0 Recycling user behaviours A user-centric index
as a recommendation engine 24

{ } CC-BY-ND 4.0 Example background • A public dataset*
of 10m movie ratings made by 71k users • One elasticsearch document per user with a list of their movie ratings 25 Movielens data * http://files.grouplens.org/datasets/movielens/ml-‐10m-‐README.html

{ } CC-BY-ND 4.0 “Uncommonly common”user behaviours 26

{ } CC-BY-ND 4.0 Conclusions 27

{ } CC-BY-ND 4.0 Entity centric indexing • Efﬁcient and
simple queries • Advanced analytics/insights • Can provide a cheaper data retention policy (daily->weekly->monthly roll-ups) • Can reuse existing elasticsearch APIs or build entity documents using external technologies 28 Advantages

{ } CC-BY-ND 4.0 Entity centric indexing • Avoid “fat
entities” • Use forgetful collections: Priority queues, circular buffers, HyperLogLog • Avoid pointless updates • Use ctx.op=“none” to avoid writes of insigniﬁcant changes • Consider options for reducing event volumes: • Use of aggregations in gathering events • Reduce related events in event-gathering script that issues updates • Parallelise the pull of event information • Consider use of nested docs if you have “cross-matching” problems 29 Tips

{ } CC-BY-ND 4.0 Entity centric indexing • Incremental entity
updates can be achieved by querying all events since the timestamp of the last run • Data integrity - implement policies for: • handling any failures in performing entity updates • retiring old entities (use of TTL?) 30 Synchronisation tips

{ } Questions? @elasticmark

{ } This work is licensed under the Creative Commons
Attribution-NoDerivatives 4.0 International License. To view a copy of this license, visit: http://creativecommons.org/licenses/by-nd/4.0/ or send a letter to: Creative Commons PO Box 1866 Mountain View, CA 94042 USA CC-BY-ND 4.0

Building Entity Centric Indexes

Building Entity Centric Indexes

Elastic Co

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript

Entity-centric indexing Mark Harwood, core developer

{ } CC-BY-ND 4.0 Entity-centric indexes (or “when aggregations don’t

{ } CC-BY-ND 4.0 A typical “event-centric” deployment 3 Time-based

{ } CC-BY-ND 4.0 Problem: some aggregations are expensive We

{ } CC-BY-ND 4.0 How to cripple elasticsearch with a

{ } CC-BY-ND 4.0 Solution A “pay-as-you-go” model to the

{ } CC-BY-ND 4.0 Solution: an “entity-centric” model 7 Usual

{ } CC-BY-ND 4.0 Entity-centric queries • WebSessions • "how

{ } CC-BY-ND 4.0 Entity-centric queries • Buyers • "What

{ } CC-BY-ND 4.0 Use case Web log analytics 10

{ } CC-BY-ND 4.0 Use case: GFORCES • Analyses website

{ } CC-BY-ND 4.0 Results of moving to entity-centric indexing

{ } CC-BY-ND 4.0 Worked example Amazon marketplace reviews -

{ } CC-BY-ND 4.0 An “entity-centric” model 14 AmazonReviews (an

{ } CC-BY-ND 4.0 Anatomy of an entity indexing groovy

{ } CC-BY-ND 4.0 Insight: which sellers have a lot

{ } CC-BY-ND 4.0 Drilling down into seller #187’s fanboys

{ } CC-BY-ND 4.0 Worked example UK 2013 car road

{ } CC-BY-ND 4.0 Example background • In the UK

{ } CC-BY-ND 4.0 Example background 20 MOTs mots.csv loadMOTs.sh

{ } CC-BY-ND 4.0 Data fusion logic 21 Car attributes

{ } CC-BY-ND 4.0 Insight: who is driving failed vehicles?

{ } CC-BY-ND 4.0 Insight: Taxis keep on trucking after

{ } CC-BY-ND 4.0 Recycling user behaviours A user-centric index

{ } CC-BY-ND 4.0 Example background • A public dataset*

{ } CC-BY-ND 4.0 “Uncommonly common”user behaviours 26

{ } CC-BY-ND 4.0 Conclusions 27

{ } CC-BY-ND 4.0 Entity centric indexing • Efﬁcient and

{ } CC-BY-ND 4.0 Entity centric indexing • Avoid “fat

{ } CC-BY-ND 4.0 Entity centric indexing • Incremental entity

{ } Questions? @elasticmark

{ } This work is licensed under the Creative Commons