Slide 1

Slide 1 text

Elastic March 8th 2017 @elasticmark Getting your data Graph-ready Mark Harwood, developer

Slide 2

Slide 2 text

Topics 2 1 The data model 2 What makes a useful choice of field to Graph? 3 Mining “collective intelligence” 4 Fraud and forensics 5 What is Graph? 6 Where next?

Slide 3

Slide 3 text

• Part of X-Pack • An API • A Kibana extension What is Graph? 3

Slide 4

Slide 4 text

The elastic Graph data model

Slide 5

Slide 5 text

{ “question”: ”Changing from port 9200?” “answered_by”: ”@elasticmark” “tags”: [”#elasticsearch”, “#kibana”] } #elasticsearch @elasticmark #kibana 9200 Documents Elastic Graph knows @elasticmark #elasticsearch NODE1 RELATIONSHIP NODE2 @elasticmark knows #elasticsearch … Edge records “Pure” graph databases It’s different.. 5 • Each relationship must be explicitly defined as an “edge” record • The database converts node IDs to pointers at write-time. • Any relationship weights must be provided by clients • All relationships are inferred from original business documents • Documents are indexed without any additional cost • Relationship weights are inferred by multiple documents

Slide 6

Slide 6 text

Single field, multiple values 6 1 2 { “myField”: [1, 2, 3] } 3

Slide 7

Slide 7 text

Multiple fields 7 { “field1”: 1, “field2”: [1, 2] } 1 2 1

Slide 8

Slide 8 text

Multiple docs 8 { “field1”: 3, “field2”: [2, 1] } { “field1”: 1, “field2”: [1, 2] } 1 2 1 3 1 2 1

Slide 9

Slide 9 text

Data problem: same ID, different fields 9 1 2 { “sender”: 3, “recipients”: [2, 1] } 1 { “sender”: 1, “recipients”: [1, 2] } 3 What if we wanted to think of these as the same vertex? What if we wanted to think of these as the same vertex?

Slide 10

Slide 10 text

Solution: “copy_to” 10 { “sender”: 3, “recipients”: [2, 1] } { “sender”: 1, “recipients”: [1, 2] } 2 1 3 Use “copy_to” "mappings": { "communication": { "properties": { "sender": { "type": "keyword", "copy_to": "participants" }, "recipients": { "type": "keyword", "copy_to": "participants" }, "participants": { "type": "keyword" }

Slide 11

Slide 11 text

What sorts of field produce useful graphs?

Slide 12

Slide 12 text

Useful characteristics of graph fields: 12 WHY? AVOID THIS USE THIS HIGH CARDINALITY Low cardinality fields produce too many links gender: male email: "[email protected]" HUMAN READABLE Long numerics are hard for people to deal with eventID: 84539564639464 tag: “#Elasticon2017” UNAMBIGUOUS Ambiguous terms = unreliable links movie: “Cape Fear” movie: “[3423] Cape Fear” Combining IDs with labels is a useful compromise

Slide 13

Slide 13 text

Graph types Support for different perspectives

Slide 14

Slide 14 text

Graph types 14 WISDOM OF CROWDS FORENSICS PURPOSE Summarise mass behaviours Scrutinize individuals EACH DOCUMENT REPRESENTS…. an unreliable opinion a fact FILTERING OF VERTICES AND LINKS IS… necessary treasonable API SETTINGS use_significance: true min_doc_count: 3+ (default settings) use_significance: false, min_doc_count: 1 EXAMPLE DATASETS MovieLens, LastFm, BestBuy Crime records, Panama papers SAME DATASET, DIFFERENT PERSPECTIVE StackOverflow data could be used to discover the hashtags that define the #elasticsearch ecosystem StackOverflow data could be used to investigate how users 4543 and 9583 are acting as sock puppets for product X.

Slide 15

Slide 15 text

WISDOM OF CROWDS * BREXIT notwithstanding…

Slide 16

Slide 16 text

16 The story of the blind men and the elephant The Blind Men And The Elephant - Poem by John Godfrey Saxe It was six men of Indostan, to learning much inclined, who went to see the elephant (Though all of them were blind), that each by observation, might satisfy his mind. The first approached the elephant, and, happening to fall, against his broad and sturdy side, at once began to bawl: 'God bless me! but the elephant, is nothing but a wall!' The second feeling of the tusk, cried: 'Ho! what have we here, so very round and smooth and sharp? To me tis mighty clear, this wonder of an elephant, is very like a spear!' The third approached the animal, and, happening to take, the squirming trunk within his hands, 'I see,' quoth he, the elephant is very like a snake!' The fourth reached out his eager hand, and felt about the knee: 'What most this wondrous beast is like, is mighty plain,' quoth he; 'Tis clear enough the elephant is very like a tree.' The fifth, who chanced to touch the ear, Said; 'E'en the blindest man can tell what this resembles most; Deny the fact who can, This marvel of an elephant, is very like a fan!' The sixth no sooner had begun, about the beast to grope, than, seizing on the swinging tail, that fell within his scope, 'I see,' quothe he, 'the elephant is very like a rope!' And so these men of Indostan, disputed loud and long, each in his own opinion, exceeding stiff and strong, Though each was partly in the right, and all were in the wrong! So, oft in theologic wars, the disputants, I ween, tread on in utter ignorance, of what each other mean, and prate about the elephant, not one of them has seen!

Slide 17

Slide 17 text

17 Your business could have the light switch… ?

Slide 18

Slide 18 text

18 What “elephant” are your users examining? In the following example let’s assume it is a theoretical map of every single musical artist and how they are musically linked.

Slide 19

Slide 19 text

19 Each person’s experiences represent glimpses of the elephant { “user”: 1, “likes”: [“mozart”,”nirvana”,“chopin”,…] } { “user”: 2, “likes”: [“foo fighters”,”nirvana”, “prince”,…] } • One document per user • Likes are `’keyword` terms • One document per user • Likes are `’keyword` terms

Slide 20

Slide 20 text

20 Their collective experiences form a weighted graph foo fighters chopin nirvana mozart

Slide 21

Slide 21 text

21 Simple single-item recommendations (no Graph API required)

Slide 22

Slide 22 text

22 Enablers ELASTICSEARCH FEATURE BENEFIT Relevance ranked queries Finds "people like me" Sampler aggregation Only considers the best-matching people Diversified sampler aggregation Ensures samples are not biased Significant terms aggregation Trims suggestions to statistically significant connections Graph API Uses all of the above in exploration Demo talk: http://bit.ly/sigsamples

Slide 23

Slide 23 text

23 Advanced recommendations using Graph: personas • Each person has multiple “personas” • Different tastes/perspectives • Important to segment these tastes and market to them separately…. * (I don’t know what tech Spotify use to do this but it’s based on the same principle)

Slide 24

Slide 24 text

Let’s rethink a single person’s “likes” as a graph: 24 fugazi dj shadow bad brains portishead { “user”: 2142 “likes”: [“fugazi”, “portishead”, “bad brains”, “dj shadow”] }

Slide 25

Slide 25 text

The person’s“likes” suggests the bands are related 25 fugazi dj shadow bad brains portishead

Slide 26

Slide 26 text

Adding related “likes” of others makes a weighted graph… 26 fugazi dj shadow bad brains portishead

Slide 27

Slide 27 text

..we can measure the real strength of the relationships 27 fugazi dj shadow bad brains portishead

Slide 28

Slide 28 text

And if we remove the weak links… 28 fugazi dj shadow bad brains portishead

Slide 29

Slide 29 text

..we reveal the person’s different personas 29 fugazi dj shadow bad brains portishead Example code: http://bit.ly/persona_graph

Slide 30

Slide 30 text

Using collective intelligence for classification

Slide 31

Slide 31 text

unstructured structured 31 Classification: a translation problem Formality Conformity Consistency Informality Freedom of expression Ambiguity

Slide 32

Slide 32 text

unstructured structured 32 What happens whenever structured meets unstructured?

Slide 33

Slide 33 text

33 We learn. { “user_search”: ”mixer”, “clicked_SKU”: 59202823 } Unstructured Structured

Slide 34

Slide 34 text

34 We learn. { “applicant_job_title”: ”warehouse manager”, “assigned_job_code”: 59202823 } Unstructured Structured

Slide 35

Slide 35 text

35 We learn. { “tweet”: ”Dans ce webinar @jpountz vous …”, “user_nationality”: “fr” } Unstructured Structured

Slide 36

Slide 36 text

36 Each pairing doc adds weight to the connections

Slide 37

Slide 37 text

37 Ambiguity is a recognisable shape… Ambiguous word or phrase!

Slide 38

Slide 38 text

38 Ambiguity is a recognisable shape… Manual detection using the UI Use case: HR consultancy advises companies if they are paying people industry-rates. First they take client’s job titles and search against standard taxonomy - graph visualisations help consultants disambiguate. Example ambiguity: People with “warehouse manager” in their job title might be data warehouse specialists or people who manage physical warehouses. Automated detection using the API Use case: Most popular user queries on an e-commerce site are fed as queries to the Graph API. The response is analysed programmatically using networkx. Any ambiguous queries generate “did you mean?” suggestions required for user interventions Example ambiguity: People who search for “mixer” might be searching for DJ mixers or food mixers. See http://bit.ly/esDidYouMean

Slide 39

Slide 39 text

39 Options for unstructured data as nodes • `keyword` type is suited for small strings • Suitable for short strings like search terms • Works with docvalues • see new “normalizer” option • `text` type breaks strings into tokens. Suited to longer free-text but has pros and cons: • Con: can’t use docvalues so need to enable fielddata=true which can be expensive. • Pro: can use ngrams to retain more of the meaning in word combos eg. “data warehouse”

Slide 40

Slide 40 text

FRAUD AND FORENSICS

Slide 41

Slide 41 text

Components of a fraud detection stack 41 Ingest Linking Risk-scoring Investigation Entity resolution, filtering Cleansing, enriching normalisation Graph exploration, anomaly detection, scoring Task lists, case management, visualisation Outcomes

Slide 42

Slide 42 text

What terms help “join-the-dots”? 42 • Strong IDs • Email addresses • Bank account numbers • Noisy data (treated) • Expansions e.g. Lib Postal for addresses • Normalizations e.g. synonyms, Basis • Strong keys from weak components • IP address + yyyyMMdd+ HH • Zip code + DOB + surname Ingest Linking Risk-scoring Investigation

Slide 43

Slide 43 text

Fraudsters make strange shapes 43 It is hard for identity manipulators to avoid reusing resources (telephones, addresses, vehicles, accounts, time) . Fraudsters generate too many “coincidences”. Use the Graph API to gather related data then raise alerts on anomalies. See example: http://bit.ly/es_fraud Ingest Linking Risk-scoring Investigation

Slide 44

Slide 44 text

Responding to alerts 44 Kibana with the Graph plugin allows investigators to examine details behind alerts. Ingest Linking Risk-scoring Investigation See example: http://bit.ly/es_fraud

Slide 45

Slide 45 text

Recording investigation outcomes 45 Hazards should be advertised in their surrounding areas. Traffic can flow more freely in all of the other unmarked areas. Sign-posting risk in a road network { “risk_id”: 545, “customer_id”: 12 } { “risk_id”: 545, “risk_level”: 4, “risk_type”: “identity manipulation” “case_owner”: “investigator 16” } { “risk_id”: 545, “customer_id”: 99 } Sign-posting risk in a social network

Slide 46

Slide 46 text

Where next?

Slide 47

Slide 47 text

More details behind connections, more perspectives 47 2,386 21 3 21 2 4 adjacency_matrix aggregation 20 20 20 = graphs over time visualizations with nested aggregations…

Slide 48

Slide 48 text

SUMMARY

Slide 49

Slide 49 text

Tips review 49 • Use “role-free” fields e.g. prefer “account_number” to “payer” and “payee” • Choose fields that are high cardinality, unambiguous and human-readable (combining IDs with their labels helps) • Make strong keys by concatenating weak components • Use the right graph query settings - the default min_doc_count and use_signficance settings are for wisdom of crowds rather than forensic type graphs. • Diversify samples on useful terms e.g. source+destination IP address pairs • Annotate your graphs by associating risk markers • Use existing graph libraries e.g. networkx to spot interesting patterns in results of Graph api calls. • To get added details behind connections use the new adjacency_matrix aggregation. • Use drill-down URLs in the Graph UI to provide more details on selected vertices (maps, timelines, heatmaps…)

Slide 50

Slide 50 text

50 More Questions? Visit us at the AMA

Slide 51

Slide 51 text

www.elastic.co

Slide 52

Slide 52 text

Except where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/ Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders. 52 Please attribute Elastic with a link to elastic.co