The Ubiquitous Graph: Two Use Cases from the Real World

The Ubiquitous Graph: Two Use Cases from the Real World

Data Science London Meetup

76bd3a3821f3bf531c2eeb445a04cbf3?s=128

Tareq Abedrabbo

December 19, 2013
Tweet

Transcript

  1. The Ubiquitous Graph Two Use Cases from the Real World

    Tareq Abedrabbo - Data Science London December 2013
  2. About me • CTO at OpenCredo • Working with Neo4j

    for (almost) 3 years on a number of different projects • Co-author of Neo4j in Action (Manning)
  3. “If I'm to believe Twitter, half of the earth's population

    are importing Wikipedia into Neo4j, for very obscure reasons.”
  4. Agenda • Graph applications • Use cases • Best practices

  5. What type of applications can be built with a graph

    database?
  6. Domain-centric applications

  7. • Well-defined data model • Data changes through user interactions

    • Flexible but predictable data structure(s) • Recommendation engines, social networks, etc… • Top-down design
  8. Data-centric applications

  9. • Complex connected data that typically models real world networks

    • Integrated from a variety of different sources • Data can be unpredictable • Telco networks, utility networks, etc… • bottom-up design
  10. Typically applications fall somewhere between these 2 types

  11. How can I use the information available in my graph?

  12. • Search and pattern-matching • Find a recommendation based on

    behaviour • Graph algorithms • Shortest path, disconnected components • Optimisation • Maximise oil flow while minimising water
  13. Graphs are naturally data-driven

  14. Use case 1: Network Impact Analysis

  15. Domain: a telco network. Millions of connected network components, services

    and customers
  16. None
  17. Requirement: Identify the impact of failing components

  18. None
  19. None
  20. Requirement: Identify interesting patterns, such as single points of failure

  21. None
  22. The network is “semi- structured”

  23. Labelled property graph is a natural fit for the model

  24. Additional “dimensions” can be added to capture abstract concepts: network

    redundancy, load-balancing
  25. Cypher queries are a natural solution to delivering the different

    requirements
  26. • Other requirements • Multiple starting points • Impact on

    quality of service • Abstraction of repeatable patterns
  27. Use case 2: Oil flow optimisation

  28. Domain: an oil extraction network. Hundreds of connected components with

    complex configuration options
  29. None
  30. Requirement: Identify candidate configurations to maximise flow

  31. Interlude: Genetic Algorithms

  32. “Search heuristic that mimics the process of natural selection” -

    Wikipedia
  33. 1. Start from an initial population of candidate solutions 2.

    Assess each solution using a fitness function 3. Apply genetic operators to derive a new and potentially fitter generation 4. Rinse and repeat!
  34. None
  35. More in detail…

  36. • Start from an initial population of candidate solutions (individuals

    or phenotypes), ideally random and large • Attribute a score to each solution using a fitness function • The only place with specific business knowledge • Apply genetic operators to create a new generation • Cross-breeding to retain best characteristics from each parent • Mutation to maintain diversity and to avoid converging to a local optima too quickly
  37. Fitness function

  38. None
  39. Crossbreeding

  40. None
  41. Mutation

  42. None
  43. • There are other genetic operators • Copy n fittest

    solutions unchanged • Carry over n unfit candidates • Carry over n randomly chosen candidates
  44. • Pros! • All domain knowledge is encapsulated in one

    place • Generate interesting solutions including counterintuitive ones • Stop when you want! • Cons! • Fitness function can become really complex • Generated solutions are not guaranteed to be practical or pretty
  45. Simply connected graph with complex components

  46. Is this even a use case for Neo4j?

  47. Persist and share calculated solutions

  48. Inspect intermediary steps

  49. Use Cypher queries to interrogate solutions

  50. • Other requirements • Identify the most practical and valuable

    adjustments to the network
  51. Distilled Best Practices

  52. • Know your domain • Test non-functional aspects • Write

    code that can handle semi-structured data
  53. Links • Twitter: @tareq_abedrabbo • Blog: http://www.terminalstate.net • OpenCredo: http://www.opencredo.com

    Thank you! questions?