Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Graph Databases to Operationalize Insights from Big Data

Using Graph Databases to Operationalize Insights from Big Data

This was a talk given by Tim Williamson and Emil Eifrem at Strata + Hadoop World NYC on September 28, 2016:
http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52052

Abstract
Enterprises that pursue data-driven operations and decisions are approaching the conclusion that graph analysis capabilities will yield critical competitive advantages. However, for this impact to be fully realized, the results of any graph analysis must be available, in real time, to operational applications, data scientists, and developers across the enterprise.

Monsanto previously attempted graph analysis using both RDBMS-based and offline batch processing techniques. In the process, Monsanto found that some couldn’t drill sufficiently deeply to result in the necessary insights; others were limited in their expressibility and therefore general usefulness outside of the data science lab; and still others weren’t able to provide answers in a short enough amount of time to be useful to the business. Monsanto finally selected a graph database used alongside a broader tech stack that includes Apache Kafka, Spark, and Oracle. This stack allows Monsanto to not just derive but also operationalize insights that have allowed it to shorten R&D cycles, better understand the dynamics of its business, and carry out certain of types of science in silico.

Tim Williamson and Emil Eifrem draw on Monsanto’s real-world experience to explain how organizations can use graph databases to operationalize insights from big data. Tim and Emil discuss Monsanto’s big data stack, using examples from Monsanto’s substantial experience with graphs, and describe the service-oriented graph architecture that has already handled over one billion requests and is available to over 150 developers, data scientists, and applications throughout Monsanto.

Tim Williamson

September 28, 2016
Tweet

More Decks by Tim Williamson

Other Decks in Programming

Transcript

  1. Using Graph Databases
    to Operationalize
    Insights from Big Data
    Emil Eifrem – CEO @ Neo Technology
    Tim Williamson – Data Scientist @ Monsanto

    View Slide

  2. Why are we here Today?
    1.What is a Graph?
    2.Graphs in Real-Time
    3.Graphs are Feeding the World

    View Slide

  3. @TimWilliate
    Data Management in 1980
    Paper Forms
    Tiny RAM Spinning Platters
    (Low Capacity / Sequential IO)

    View Slide

  4. Traditional DBMS Technology

    View Slide

  5. Data Management in 2016
    Dynamic Real-World Systems
    SSD/Flash
    (High-Capacity Storage &
    Ultra-Fast Random I/O)
    Abundant RAM

    View Slide

  6. A Way of Representing Data
    DATA DATA

    View Slide

  7. A Way of Representing Data
    Relational
    Database
    Good for:
    • Well-understood data structures that
    don’t change too frequently
    • Known problems involving discrete
    parts of the data, or minimal
    connectivity
    DATA
    1980s

    View Slide

  8. A Way of Representing Data
    Graph
    Database
    Relational
    Database
    Good for:
    • Dynamic systems: where the data
    topology is difficult to predict
    • Dynamic requirements:
    that evolve with the business
    • Problems where the relationships in
    data contribute meaning & value
    Good for:
    • Well-understood data structures that
    don’t change too frequently
    • Known problems involving discrete
    parts of the data, or minimal
    connectivity
    1980s 2016

    View Slide

  9. KNOWS
    NAME: ANN
    AGE: 32
    NODE
    PROPERTIES
    RELATIONSHIP
    A Graph Is

    View Slide

  10. A Graph Is

    View Slide

  11. A Graph Is

    View Slide

  12. A Graph Is

    View Slide

  13. Describing Graphs
    Business
    Domain
    Ann Dan
    Loves
    Graph Data
    Model
    (Dan)
    (Ann) -[:LOVES]->
    Cypher Query

    View Slide

  14. Cypher
    Example HR Query in SQL The Same Query using Cypher
    MATCH (boss)-[:MANAGES*0..3]->(sub),
    (sub)-[:MANAGES*1..3]->(report)
    WHERE boss.name = “John Doe”
    RETURN sub.name AS Subordinate,
    count(report) AS Total
    Project Impact
    Less time writing queries
    • More time understanding the answers
    • Leaving time to ask the next question
    Less time debugging queries:
    • More time writing the next piece of code
    • Improved quality of overall code base
    Code that’s easier to read:
    • Faster ramp-up for new project members
    • Improved maintainability & troubleshooting

    View Slide

  15. Users Love Cypher

    View Slide

  16. openCypher

    View Slide

  17. Low Latency Query Performance
    “We found Neo4j to be literally thousands of times faster
    than our prior MySQL solution, with queries that require
    10-100 times less code. Today, Neo4j provides eBay with
    functionality that was previously impossible.”
    - Volker Pacher, Senior Developer
    “Minutes to milliseconds” performance
    Queries up to 1000x faster than RDBMS or other NoSQL

    View Slide

  18. Fastest Growing Category in Big Data
    Sep 2015
    May 2015
    Jan 2015
    Sep 2014
    May 2014
    Jan 2014
    Sep 2013
    May 2013
    100
    Popularity Changes
    500
    600
    700
    200
    300
    400
    Jan 2013
    © DB-Engines.com 2015
    • Wide column stores
    • RDF stores
    • Document stores
    • Search engines
    • Native XML DBMS
    • Key-value stores
    • Object oriented DBMS
    • Multivalue DBMS
    • Times Series DBMS
    Relational database
    Graph Database

    View Slide

  19. Popular Graph Database Use Cases
    Real-Time
    Recommendations
    Fraud
    Detection
    Network &
    IT Operations
    Master Data
    Management
    Graph-Based
    Search
    Identity & Access
    Management

    View Slide

  20. What is Real-Time?
    @TimWilliate

    View Slide

  21. Real-Time
    When Emil Was in School
    “A system is said to be real-time if
    the total correctness of an operation
    depends not only upon its logical
    correctness, but also upon the time
    limit in which it is performed.”
    Shin, K.G.; Ramanathan, P. (Jan 1994)."Real-time computing: a new discipline of computer science and engineering”.
    Proceedings of the IEEE.

    View Slide

  22. Real-Time
    In Web 2.0
    “My focus will be companies
    exploiting ‘real-time data,’ which is
    ‘the next billion dollar market
    opportunity.’”
    Interview in TechCrunch, 2009
    Ron Conway, angel investor godfather of silicon valley

    View Slide

  23. Real-Time
    Emil and Tim’s Definition of Real-Time Data

    View Slide

  24. Real-Time
    Emil and Tim’s Definition of Real-Time Data

    View Slide

  25. Real-Time
    Emil and Tim’s Definition of Real-Time Data

    View Slide

  26. Real-Time
    Emil and Tim’s Definition of Real-Time Data

    View Slide

  27. Graphs Are Feeding the World
    @TimWilliate

    View Slide

  28. Improving Genetics has Scaled Agricultural Output
    for Millennia
    @TimWilliate

    View Slide

  29. Modern Breeding Techniques Accelerated this Gain
    Source: http://www.ers.usda.gov/data-products/feed-grains-database/feed-grains-yearbook-tables.aspx
    @TimWilliate

    View Slide

  30. Selecting Better Plants via Field Trial
    @TimWilliate

    View Slide

  31. Rapid Breeding Improvement Derives from Cycling
    @TimWilliate

    View Slide

  32. View Slide

  33. The Operational Uses for Ancestry are Numerous
    § Which crosses are predicted to be the most effective?
    § Where in the pipeline are the descendants of a cross?
    § Are the results of high-throughput genotyping correct?
    § What is the frequency of commercial success?
    § Etc…
    @TimWilliate
    Questions like these are asked from applications across the
    pipeline, all serving scientists expecting to make rapid decisions

    View Slide

  34. Operationalizing Ancestry Requires Low-Latency Reads
    A population at the “advancing” horizon of the pipeline can easily have an ancestry > 50 levels deep
    @TimWilliate

    View Slide

  35. Low Latency Reads + Fresh Data = Real-Time Data
    @TimWilliate

    View Slide

  36. Accessing Genetic Ancestry in a RESTful Style
    @TimWilliate
    § Domain-centric API
    § ~ 40 API resources
    § ~ 20 query grammar elements

    View Slide

  37. Accessing Genetic Ancestry in a RESTful Style
    @TimWilliate
    § Domain-centric API
    § ~ 40 API resources
    § ~ 20 query grammar elements

    View Slide

  38. Accessing Genetic Ancestry in a RESTful Style
    @TimWilliate
    § Domain-centric API
    § ~ 40 API resources
    § ~ 20 query grammar elements
    {“nodes”: [
    {“id”: 1}, {“id”: 2}, {“id”: 3}, {“id”: 4}, {“id”: 5}
    ],
    “relationships”: [
    {“from”: 1, “to”: 3, “parental_role”: “female”},
    {“from”: 2, “to”: 3, “parental_role”: “male”},
    {“from”: 3, “to”: 4, “parental_role”: “female”},
    {“from”: 4, “to”: 5, “parental_role”: “female”}
    ]}
    /population/5/ancestors

    View Slide

  39. Accessing Genetic Ancestry in a RESTful Style
    @TimWilliate
    § Domain-centric API
    § ~ 40 API resources
    § ~ 20 query grammar elements
    {“nodes”: [
    {“id”: 1}, {“id”: 2}, {“id”: 3}, {“id”: 4}, {“id”: 5}
    ],
    “relationships”: [
    {“from”: 1, “to”: 3, “parental_role”: “female”},
    {“from”: 2, “to”: 3, “parental_role”: “male”},
    {“from”: 3, “to”: 4, “parental_role”: “female”},
    {“from”: 4, “to”: 5, “parental_role”: “female”}
    ]}
    {
    “female”: {“id”: 1},
    “male”: {“id”: 2}
    }
    /population/5/ancestors
    /population/5/binary-cross

    View Slide

  40. An Ops View of Ancestry-as-a-Service
    § 2 years continuous
    production operation
    § > 200 application and data
    scientist users
    § Store Size
    - ~ 800 million nodes
    - ~ 1.3 billion relationships
    - ~ 1.8 billion properties
    Continuous and peaky mixed read/write load
    @TimWilliate

    View Slide

  41. The Ultimate Value of Ancestry is Realized in the
    Biological Information it Allows to be Linked
    @TimWilliate

    View Slide

  42. Corn Parent Galaxy
    The complete genetic
    history of every corn
    parent at Monsanto

    View Slide

  43. Selecting Better Plants via Genome Wide Selection
    @TimWilliate

    View Slide

  44. Thank You!

    View Slide