Using Graph Databases to Operationalize Insights from Big Data

Using Graph Databases to Operationalize Insights from Big Data

This was a talk given by Tim Williamson and Emil Eifrem at Strata + Hadoop World NYC on September 28, 2016:
http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52052

Abstract
Enterprises that pursue data-driven operations and decisions are approaching the conclusion that graph analysis capabilities will yield critical competitive advantages. However, for this impact to be fully realized, the results of any graph analysis must be available, in real time, to operational applications, data scientists, and developers across the enterprise.

Monsanto previously attempted graph analysis using both RDBMS-based and offline batch processing techniques. In the process, Monsanto found that some couldn’t drill sufficiently deeply to result in the necessary insights; others were limited in their expressibility and therefore general usefulness outside of the data science lab; and still others weren’t able to provide answers in a short enough amount of time to be useful to the business. Monsanto finally selected a graph database used alongside a broader tech stack that includes Apache Kafka, Spark, and Oracle. This stack allows Monsanto to not just derive but also operationalize insights that have allowed it to shorten R&D cycles, better understand the dynamics of its business, and carry out certain of types of science in silico.

Tim Williamson and Emil Eifrem draw on Monsanto’s real-world experience to explain how organizations can use graph databases to operationalize insights from big data. Tim and Emil discuss Monsanto’s big data stack, using examples from Monsanto’s substantial experience with graphs, and describe the service-oriented graph architecture that has already handled over one billion requests and is available to over 150 developers, data scientists, and applications throughout Monsanto.

Bda10805b9e6f7538744eb6cf8827b61?s=128

Tim Williamson

September 28, 2016
Tweet

Transcript

  1. 1.

    Using Graph Databases to Operationalize Insights from Big Data Emil

    Eifrem – CEO @ Neo Technology Tim Williamson – Data Scientist @ Monsanto
  2. 2.

    Why are we here Today? 1.What is a Graph? 2.Graphs

    in Real-Time 3.Graphs are Feeding the World
  3. 3.
  4. 7.

    A Way of Representing Data Relational Database Good for: •

    Well-understood data structures that don’t change too frequently • Known problems involving discrete parts of the data, or minimal connectivity DATA 1980s
  5. 8.

    A Way of Representing Data Graph Database Relational Database Good

    for: • Dynamic systems: where the data topology is difficult to predict • Dynamic requirements: that evolve with the business • Problems where the relationships in data contribute meaning & value Good for: • Well-understood data structures that don’t change too frequently • Known problems involving discrete parts of the data, or minimal connectivity 1980s 2016
  6. 13.
  7. 14.

    Cypher Example HR Query in SQL The Same Query using

    Cypher MATCH (boss)-[:MANAGES*0..3]->(sub), (sub)-[:MANAGES*1..3]->(report) WHERE boss.name = “John Doe” RETURN sub.name AS Subordinate, count(report) AS Total Project Impact Less time writing queries • More time understanding the answers • Leaving time to ask the next question Less time debugging queries: • More time writing the next piece of code • Improved quality of overall code base Code that’s easier to read: • Faster ramp-up for new project members • Improved maintainability & troubleshooting
  8. 17.

    Low Latency Query Performance “We found Neo4j to be literally

    thousands of times faster than our prior MySQL solution, with queries that require 10-100 times less code. Today, Neo4j provides eBay with functionality that was previously impossible.” - Volker Pacher, Senior Developer “Minutes to milliseconds” performance Queries up to 1000x faster than RDBMS or other NoSQL
  9. 18.

    Fastest Growing Category in Big Data Sep 2015 May 2015

    Jan 2015 Sep 2014 May 2014 Jan 2014 Sep 2013 May 2013 100 Popularity Changes 500 600 700 200 300 400 Jan 2013 © DB-Engines.com 2015 • Wide column stores • RDF stores • Document stores • Search engines • Native XML DBMS • Key-value stores • Object oriented DBMS • Multivalue DBMS • Times Series DBMS Relational database Graph Database
  10. 19.

    Popular Graph Database Use Cases Real-Time Recommendations Fraud Detection Network

    & IT Operations Master Data Management Graph-Based Search Identity & Access Management
  11. 21.

    Real-Time When Emil Was in School “A system is said

    to be real-time if the total correctness of an operation depends not only upon its logical correctness, but also upon the time limit in which it is performed.” Shin, K.G.; Ramanathan, P. (Jan 1994)."Real-time computing: a new discipline of computer science and engineering”. Proceedings of the IEEE.
  12. 22.

    Real-Time In Web 2.0 “My focus will be companies exploiting

    ‘real-time data,’ which is ‘the next billion dollar market opportunity.’” Interview in TechCrunch, 2009 Ron Conway, angel investor godfather of silicon valley
  13. 32.
  14. 33.

    The Operational Uses for Ancestry are Numerous § Which crosses

    are predicted to be the most effective? § Where in the pipeline are the descendants of a cross? § Are the results of high-throughput genotyping correct? § What is the frequency of commercial success? § Etc… @TimWilliate Questions like these are asked from applications across the pipeline, all serving scientists expecting to make rapid decisions
  15. 34.

    Operationalizing Ancestry Requires Low-Latency Reads A population at the “advancing”

    horizon of the pipeline can easily have an ancestry > 50 levels deep @TimWilliate
  16. 36.

    Accessing Genetic Ancestry in a RESTful Style @TimWilliate § Domain-centric

    API § ~ 40 API resources § ~ 20 query grammar elements
  17. 37.

    Accessing Genetic Ancestry in a RESTful Style @TimWilliate § Domain-centric

    API § ~ 40 API resources § ~ 20 query grammar elements
  18. 38.

    Accessing Genetic Ancestry in a RESTful Style @TimWilliate § Domain-centric

    API § ~ 40 API resources § ~ 20 query grammar elements {“nodes”: [ {“id”: 1}, {“id”: 2}, {“id”: 3}, {“id”: 4}, {“id”: 5} ], “relationships”: [ {“from”: 1, “to”: 3, “parental_role”: “female”}, {“from”: 2, “to”: 3, “parental_role”: “male”}, {“from”: 3, “to”: 4, “parental_role”: “female”}, {“from”: 4, “to”: 5, “parental_role”: “female”} ]} /population/5/ancestors
  19. 39.

    Accessing Genetic Ancestry in a RESTful Style @TimWilliate § Domain-centric

    API § ~ 40 API resources § ~ 20 query grammar elements {“nodes”: [ {“id”: 1}, {“id”: 2}, {“id”: 3}, {“id”: 4}, {“id”: 5} ], “relationships”: [ {“from”: 1, “to”: 3, “parental_role”: “female”}, {“from”: 2, “to”: 3, “parental_role”: “male”}, {“from”: 3, “to”: 4, “parental_role”: “female”}, {“from”: 4, “to”: 5, “parental_role”: “female”} ]} { “female”: {“id”: 1}, “male”: {“id”: 2} } /population/5/ancestors /population/5/binary-cross
  20. 40.

    An Ops View of Ancestry-as-a-Service § 2 years continuous production

    operation § > 200 application and data scientist users § Store Size - ~ 800 million nodes - ~ 1.3 billion relationships - ~ 1.8 billion properties Continuous and peaky mixed read/write load @TimWilliate
  21. 41.

    The Ultimate Value of Ancestry is Realized in the Biological

    Information it Allows to be Linked @TimWilliate