Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Getting Your Data Graph-Ready

Getting Your Data Graph-Ready

Knowing what sort of data makes sense to put in Graph and how to prepare it is often a challenge for new users. This session will walk through examples of how to model your data in order to start exploring the interesting connections it contains. Learn about models for “wisdom of crowd” style applications and configurations to support “forensic” style investigations.

Mark Harwood l Graph Creator l Elastic

Elastic Co

March 09, 2017
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. Topics 2 1 The data model 2 What makes a

    useful choice of field to Graph? 3 Mining “collective intelligence” 4 Fraud and forensics 5 What is Graph? 6 Where next?
  2. • Part of X-Pack • An API • A Kibana

    extension What is Graph? 3
  3. { “question”: ”Changing from port 9200?” “answered_by”: ”@elasticmark” “tags”: [”#elasticsearch”,

    “#kibana”] } #elasticsearch @elasticmark #kibana 9200 Documents Elastic Graph knows @elasticmark #elasticsearch NODE1 RELATIONSHIP NODE2 @elasticmark knows #elasticsearch … Edge records “Pure” graph databases It’s different.. 5 • Each relationship must be explicitly defined as an “edge” record • The database converts node IDs to pointers at write-time. • Any relationship weights must be provided by clients • All relationships are inferred from original business documents • Documents are indexed without any additional cost • Relationship weights are inferred by multiple documents
  4. Multiple docs 8 { “field1”: 3, “field2”: [2, 1] }

    { “field1”: 1, “field2”: [1, 2] } 1 2 1 3 1 2 1
  5. Data problem: same ID, different fields 9 1 2 {

    “sender”: 3, “recipients”: [2, 1] } 1 { “sender”: 1, “recipients”: [1, 2] } 3 What if we wanted to think of these as the same vertex? What if we wanted to think of these as the same vertex?
  6. Solution: “copy_to” 10 { “sender”: 3, “recipients”: [2, 1] }

    { “sender”: 1, “recipients”: [1, 2] } 2 1 3 Use “copy_to” "mappings": { "communication": { "properties": { "sender": { "type": "keyword", "copy_to": "participants" }, "recipients": { "type": "keyword", "copy_to": "participants" }, "participants": { "type": "keyword" }
  7. Useful characteristics of graph fields: 12 WHY? AVOID THIS USE

    THIS HIGH CARDINALITY Low cardinality fields produce too many links gender: male email: "[email protected]" HUMAN READABLE Long numerics are hard for people to deal with eventID: 84539564639464 tag: “#Elasticon2017” UNAMBIGUOUS Ambiguous terms = unreliable links movie: “Cape Fear” movie: “[3423] Cape Fear” Combining IDs with labels is a useful compromise
  8. Graph types 14 WISDOM OF CROWDS FORENSICS PURPOSE Summarise mass

    behaviours Scrutinize individuals EACH DOCUMENT REPRESENTS…. an unreliable opinion a fact FILTERING OF VERTICES AND LINKS IS… necessary treasonable API SETTINGS use_significance: true min_doc_count: 3+ (default settings) use_significance: false, min_doc_count: 1 EXAMPLE DATASETS MovieLens, LastFm, BestBuy Crime records, Panama papers SAME DATASET, DIFFERENT PERSPECTIVE StackOverflow data could be used to discover the hashtags that define the #elasticsearch ecosystem StackOverflow data could be used to investigate how users 4543 and 9583 are acting as sock puppets for product X.
  9. 16 The story of the blind men and the elephant

    The Blind Men And The Elephant - Poem by John Godfrey Saxe It was six men of Indostan, to learning much inclined, who went to see the elephant (Though all of them were blind), that each by observation, might satisfy his mind. The first approached the elephant, and, happening to fall, against his broad and sturdy side, at once began to bawl: 'God bless me! but the elephant, is nothing but a wall!' The second feeling of the tusk, cried: 'Ho! what have we here, so very round and smooth and sharp? To me tis mighty clear, this wonder of an elephant, is very like a spear!' The third approached the animal, and, happening to take, the squirming trunk within his hands, 'I see,' quoth he, the elephant is very like a snake!' The fourth reached out his eager hand, and felt about the knee: 'What most this wondrous beast is like, is mighty plain,' quoth he; 'Tis clear enough the elephant is very like a tree.' The fifth, who chanced to touch the ear, Said; 'E'en the blindest man can tell what this resembles most; Deny the fact who can, This marvel of an elephant, is very like a fan!' The sixth no sooner had begun, about the beast to grope, than, seizing on the swinging tail, that fell within his scope, 'I see,' quothe he, 'the elephant is very like a rope!' And so these men of Indostan, disputed loud and long, each in his own opinion, exceeding stiff and strong, Though each was partly in the right, and all were in the wrong! So, oft in theologic wars, the disputants, I ween, tread on in utter ignorance, of what each other mean, and prate about the elephant, not one of them has seen!
  10. 18 What “elephant” are your users examining? In the following

    example let’s assume it is a theoretical map of every single musical artist and how they are musically linked.
  11. 19 Each person’s experiences represent glimpses of the elephant {

    “user”: 1, “likes”: [“mozart”,”nirvana”,“chopin”,…] } { “user”: 2, “likes”: [“foo fighters”,”nirvana”, “prince”,…] } • One document per user • Likes are `’keyword` terms • One document per user • Likes are `’keyword` terms
  12. 22 Enablers ELASTICSEARCH FEATURE BENEFIT Relevance ranked queries Finds "people

    like me" Sampler aggregation Only considers the best-matching people Diversified sampler aggregation Ensures samples are not biased Significant terms aggregation Trims suggestions to statistically significant connections Graph API Uses all of the above in exploration Demo talk: http://bit.ly/sigsamples
  13. 23 Advanced recommendations using Graph: personas • Each person has

    multiple “personas” • Different tastes/perspectives • Important to segment these tastes and market to them separately…. * (I don’t know what tech Spotify use to do this but it’s based on the same principle)
  14. Let’s rethink a single person’s “likes” as a graph: 24

    fugazi dj shadow bad brains portishead { “user”: 2142 “likes”: [“fugazi”, “portishead”, “bad brains”, “dj shadow”] }
  15. ..we can measure the real strength of the relationships 27

    fugazi dj shadow bad brains portishead
  16. ..we reveal the person’s different personas 29 fugazi dj shadow

    bad brains portishead Example code: http://bit.ly/persona_graph
  17. 35 We learn. { “tweet”: ”Dans ce webinar @jpountz vous

    …”, “user_nationality”: “fr” } Unstructured Structured
  18. 38 Ambiguity is a recognisable shape… Manual detection using the

    UI Use case: HR consultancy advises companies if they are paying people industry-rates. First they take client’s job titles and search against standard taxonomy - graph visualisations help consultants disambiguate. Example ambiguity: People with “warehouse manager” in their job title might be data warehouse specialists or people who manage physical warehouses. Automated detection using the API Use case: Most popular user queries on an e-commerce site are fed as queries to the Graph API. The response is analysed programmatically using networkx. Any ambiguous queries generate “did you mean?” suggestions required for user interventions Example ambiguity: People who search for “mixer” might be searching for DJ mixers or food mixers. See http://bit.ly/esDidYouMean
  19. 39 Options for unstructured data as nodes • `keyword` type

    is suited for small strings • Suitable for short strings like search terms • Works with docvalues • see new “normalizer” option • `text` type breaks strings into tokens. Suited to longer free-text but has pros and cons: • Con: can’t use docvalues so need to enable fielddata=true which can be expensive. • Pro: can use ngrams to retain more of the meaning in word combos eg. “data warehouse”
  20. Components of a fraud detection stack 41 Ingest Linking Risk-scoring

    Investigation Entity resolution, filtering Cleansing, enriching normalisation Graph exploration, anomaly detection, scoring Task lists, case management, visualisation Outcomes
  21. What terms help “join-the-dots”? 42 • Strong IDs • Email

    addresses • Bank account numbers • Noisy data (treated) • Expansions e.g. Lib Postal for addresses • Normalizations e.g. synonyms, Basis • Strong keys from weak components • IP address + yyyyMMdd+ HH • Zip code + DOB + surname Ingest Linking Risk-scoring Investigation
  22. Fraudsters make strange shapes 43 It is hard for identity

    manipulators to avoid reusing resources (telephones, addresses, vehicles, accounts, time) . Fraudsters generate too many “coincidences”. Use the Graph API to gather related data then raise alerts on anomalies. See example: http://bit.ly/es_fraud Ingest Linking Risk-scoring Investigation
  23. Responding to alerts 44 Kibana with the Graph plugin allows

    investigators to examine details behind alerts. Ingest Linking Risk-scoring Investigation See example: http://bit.ly/es_fraud
  24. Recording investigation outcomes 45 Hazards should be advertised in their

    surrounding areas. Traffic can flow more freely in all of the other unmarked areas. Sign-posting risk in a road network { “risk_id”: 545, “customer_id”: 12 } { “risk_id”: 545, “risk_level”: 4, “risk_type”: “identity manipulation” “case_owner”: “investigator 16” } { “risk_id”: 545, “customer_id”: 99 } Sign-posting risk in a social network
  25. More details behind connections, more perspectives 47 2,386 21 3

    21 2 4 adjacency_matrix aggregation 20 20 20 = graphs over time visualizations with nested aggregations…
  26. Tips review 49 • Use “role-free” fields e.g. prefer “account_number”

    to “payer” and “payee” • Choose fields that are high cardinality, unambiguous and human-readable (combining IDs with their labels helps) • Make strong keys by concatenating weak components • Use the right graph query settings - the default min_doc_count and use_signficance settings are for wisdom of crowds rather than forensic type graphs. • Diversify samples on useful terms e.g. source+destination IP address pairs • Annotate your graphs by associating risk markers • Use existing graph libraries e.g. networkx to spot interesting patterns in results of Graph api calls. • To get added details behind connections use the new adjacency_matrix aggregation. • Use drill-down URLs in the Graph UI to provide more details on selected vertices (maps, timelines, heatmaps…)
  27. Except where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/

    Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders. 52 Please attribute Elastic with a link to elastic.co