$30 off During Our Annual Pro Sale. View Details »

The Ubiquitous Graph: Two Use Cases from the Real World

The Ubiquitous Graph: Two Use Cases from the Real World

Data Science London Meetup

Tareq Abedrabbo

December 19, 2013
Tweet

More Decks by Tareq Abedrabbo

Other Decks in Technology

Transcript

  1. The Ubiquitous Graph Two Use Cases from the Real World

    Tareq Abedrabbo - Data Science London December 2013
  2. About me • CTO at OpenCredo • Working with Neo4j

    for (almost) 3 years on a number of different projects • Co-author of Neo4j in Action (Manning)
  3. “If I'm to believe Twitter, half of the earth's population

    are importing Wikipedia into Neo4j, for very obscure reasons.”
  4. Agenda • Graph applications • Use cases • Best practices

  5. What type of applications can be built with a graph

    database?
  6. Domain-centric applications

  7. • Well-defined data model • Data changes through user interactions

    • Flexible but predictable data structure(s) • Recommendation engines, social networks, etc… • Top-down design
  8. Data-centric applications

  9. • Complex connected data that typically models real world networks

    • Integrated from a variety of different sources • Data can be unpredictable • Telco networks, utility networks, etc… • bottom-up design
  10. Typically applications fall somewhere between these 2 types

  11. How can I use the information available in my graph?

  12. • Search and pattern-matching • Find a recommendation based on

    behaviour • Graph algorithms • Shortest path, disconnected components • Optimisation • Maximise oil flow while minimising water
  13. Graphs are naturally data-driven

  14. Use case 1: Network Impact Analysis

  15. Domain: a telco network. Millions of connected network components, services

    and customers
  16. None
  17. Requirement: Identify the impact of failing components

  18. None
  19. None
  20. Requirement: Identify interesting patterns, such as single points of failure

  21. None
  22. The network is “semi- structured”

  23. Labelled property graph is a natural fit for the model

  24. Additional “dimensions” can be added to capture abstract concepts: network

    redundancy, load-balancing
  25. Cypher queries are a natural solution to delivering the different

    requirements
  26. • Other requirements • Multiple starting points • Impact on

    quality of service • Abstraction of repeatable patterns
  27. Use case 2: Oil flow optimisation

  28. Domain: an oil extraction network. Hundreds of connected components with

    complex configuration options
  29. None
  30. Requirement: Identify candidate configurations to maximise flow

  31. Interlude: Genetic Algorithms

  32. “Search heuristic that mimics the process of natural selection” -

    Wikipedia
  33. 1. Start from an initial population of candidate solutions 2.

    Assess each solution using a fitness function 3. Apply genetic operators to derive a new and potentially fitter generation 4. Rinse and repeat!
  34. None
  35. More in detail…

  36. • Start from an initial population of candidate solutions (individuals

    or phenotypes), ideally random and large • Attribute a score to each solution using a fitness function • The only place with specific business knowledge • Apply genetic operators to create a new generation • Cross-breeding to retain best characteristics from each parent • Mutation to maintain diversity and to avoid converging to a local optima too quickly
  37. Fitness function

  38. None
  39. Crossbreeding

  40. None
  41. Mutation

  42. None
  43. • There are other genetic operators • Copy n fittest

    solutions unchanged • Carry over n unfit candidates • Carry over n randomly chosen candidates
  44. • Pros! • All domain knowledge is encapsulated in one

    place • Generate interesting solutions including counterintuitive ones • Stop when you want! • Cons! • Fitness function can become really complex • Generated solutions are not guaranteed to be practical or pretty
  45. Simply connected graph with complex components

  46. Is this even a use case for Neo4j?

  47. Persist and share calculated solutions

  48. Inspect intermediary steps

  49. Use Cypher queries to interrogate solutions

  50. • Other requirements • Identify the most practical and valuable

    adjustments to the network
  51. Distilled Best Practices

  52. • Know your domain • Test non-functional aspects • Write

    code that can handle semi-structured data
  53. Links • Twitter: @tareq_abedrabbo • Blog: http://www.terminalstate.net • OpenCredo: http://www.opencredo.com

    Thank you! questions?