Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Neo4J Tales from the Trenches: A Recommendation Engine Case Study

Neo4J Tales from the Trenches: A Recommendation Engine Case Study

Slides accompanying talk done with Michal Bachman at SkillsMatter. Video available here
Part 1: https://skillsmatter.com/skillscasts/3206-neo4j-tales
Part 2: https://skillsmatter.com/skillscasts/3341-neo4j-tales-pt2

Nicki Watt

April 25, 2012
Tweet

More Decks by Nicki Watt

Other Decks in Technology

Transcript

  1. Neo4J – Tales from the Trenches A RECOMMENDATION ENGINE CASE

    STUDY Michal Bachman & Nicki Watt @bachmanm & @techiewatt
  2. Who we are … Nicki Watt Michal Bachman OpenCredo works

    for works for colleague of role = “consultant” role = “consultant” Neo4J Opigram partner of uses works on works on
  3. Things Opinions Panelists about provides Recommendations/ Interesting Insights People who

    like … also tend to like … People who like … tend to support … People who like … describe themselves as …. generates About (themselves) Opigram
  4. Problem Scale •  ~ 150k panelists (a.k.a. users) •  ~

    100k “things” (movies, books,…) •  ~ 8M relationships •  All that UK only •  Soon to be launched in the US
  5. Neo4J •  Graph Database •  Schema-less (NoSQL) •  Vertices and

    Edges •  a.k.a. Nodes and Relationships •  Traversals
  6. Neo4J Nicki Watt Michal Bachman OpenCredo works for works for

    colleague of role = “consultant” role = “consultant” Neo4J Opigram partner of uses works on works on
  7. Opigram + Neo4J •  Taxonomy of “things” •  Opinions on

    “things” •  Recommendations •  Offline “Crunching”
  8. Lessons Learned •  Everyone loves Neo4J! Find praise online • 

    “Trenches Talk” – insight into some real challenges – approaches to solutions •  We have 5 practical lessons for you – Tips – Tricks – Troubles
  9. Lessons Learned •  Lesson 1: Graph “Schema” •  Lesson 2:

    Neo Node IDs •  Lesson 3: Graph-wide Operations •  Lesson 4: Extracting Randomised Data •  Lesson 5: Multi-threading
  10. Michal Pulp Fiction review text = “…” Movie type Descriptor

    Boring Cool Funny Romantic type type type type described as described as votes = 1 votes = 1 descriptors = Cool, Funny
  11. Michal Pulp Fiction review of Movie type Descriptor Boring Cool

    Funny Romantic type type type type described as described as Review text=“…” created
  12. Lesson 1: Conclusion •  Think about your graph structure • 

    Tailor for questions asked •  Important things => nodes •  Evolve your graph •  More about data migration in Lesson 3
  13. Neo Node IDs •  What are they •  How can

    I use them in my application – You should not! •  Why not – Not Stable – Ids are recycled over time, only guaranteed to be unique during a specific time span – Internal Neo implementation detail
  14. Michal Descriptor Boring Funny Romantic type type type Nicki Jim

    MySQL USER_ID NEO_ NODE_ID ACTIVE 101 1 Y 102 2 Y 103 3 Y Panelist type type type 1 2 4 5 6 7 8 USER_ID NEO_ NODE_ID ACTIVE 101 1 Y 102 2 Y 103 3 N Cool type 3 Jim has now become Cool !
  15. Alternate ID Strategies •  Challenge: Find a starting point in

    graph •  Client provided IDs – Add as a standard property on the node – Add to index (or use auto indexer) •  Natural vs. Synthetic IDs •  Auto generate your own IDs – Hook into Neo4J Transaction Kernel – Use auto indexer
  16. Auto generate your own IDs 1)  Implement TransactionEventHandler 2) Register

    TransactionEventHandler with graphDatabaseService 3) Turn auto indexing on for seamless generation
  17. Lesson 2: Conclusion Neo4J Node IDs Don’t expose! + Index

    is your friend •  Use an index to lookup specific nodes •  If used inappropriately may result in inaccurate data & unexpected behaviour •  Internal Neo implementation detail – subject to change
  18. Lesson 3: Graph-wide Operations •  Batch Updates •  Delete relationships

    only from one side •  GlobalGraphOperations since 1.6 •  No need for TX when reading
  19. Lesson 3: Graph-wide Operations •  Batch Updates •  Delete only

    from 1 side •  GlobalGraphOperations since 1.6 •  No need for TX when reading
  20. Lesson 3: Graph-wide Operations •  Batch Updates •  Delete only

    from 1 side •  GlobalGraphOperations since 1.6 •  No need for TX when reading
  21. Lesson 3: Graph-wide Operations •  Batch Updates •  Delete only

    from 1 side •  GlobalGraphOperations since 1.6 •  No need for TX when reading
  22. Lesson 3: Graph-wide Operations •  Batch Updates •  Delete only

    from 1 side •  GlobalGraphOperations since 1.6 •  No need for TX when reading
  23. Lesson 3: Graph-wide Operations •  Batch Updates •  Delete only

    from 1 side •  GlobalGraphOperations since 1.6 •  No need for TX when reading
  24. Extracting Randomised Data •  Use Cases – Provide Random Suggestions to

    users – Use for statistical analysis aka “Random Sampling” •  Problem – No built in Neo4J support – Not Neo4J’s sweet spot – May result in very bad performance
  25. Options •  Randomisation Strategies – “Load, Shuffle, Pick” – “Hit and Miss”

    – Custom Relationship Expander/Evaluator – Reservoir Sampling •  Performance Helpers – Indexes – Front with a cache if need be
  26. Traversals vs. Index 0 5000 10000 15000 20000 25000 30000

    35000 40000 45000 5000 10000 20000 40000 80000 160000 1.5 TRAVERSAL PASS 1 (COLD) 1.4.2 TRAVERSAL PASS 1 (COLD) 1.4.2 TRAVERSAL PASS 2 (WARMISH) 1.5 TRAVERSAL PASS 2 (WARMISH) 1.5 INDEX 1.6.2 TRAVERSAL PASS 1 (COLD)" 1.6.2 TRAVERSAL PASS 2 (WARMISH) 25 random nodes extracted from [Sample Size] using “Reservoir Sampling” algorithm X-Axis: Sample Size Y-Axis: Time (milliseconds) Use of indexes reduced time to +- 300 - 1000ms from cold
  27. Conclusion •  Most options are not “truly random” more “randomish”

    •  Primarily has bad performance when hitting cold parts of graph ** This is generally true of any persistence technology which needs to perform random scattered disk access •  Performance can be improved with – Indexes – Caching
  28. Lesson 5: Multi-threading •  Shortcoming in Neo4J •  Fixed in

    version 1.7 •  Avoid relationship properties in multi- threaded pre-1.7 apps