Save 37% off PRO during our Black Friday Sale! »

Using Graph Databases to Operationalize Insights from Big Data

Using Graph Databases to Operationalize Insights from Big Data

This was a talk given by Tim Williamson and Emil Eifrem at Strata + Hadoop World NYC on September 28, 2016:
http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52052

Abstract
Enterprises that pursue data-driven operations and decisions are approaching the conclusion that graph analysis capabilities will yield critical competitive advantages. However, for this impact to be fully realized, the results of any graph analysis must be available, in real time, to operational applications, data scientists, and developers across the enterprise.

Monsanto previously attempted graph analysis using both RDBMS-based and offline batch processing techniques. In the process, Monsanto found that some couldn’t drill sufficiently deeply to result in the necessary insights; others were limited in their expressibility and therefore general usefulness outside of the data science lab; and still others weren’t able to provide answers in a short enough amount of time to be useful to the business. Monsanto finally selected a graph database used alongside a broader tech stack that includes Apache Kafka, Spark, and Oracle. This stack allows Monsanto to not just derive but also operationalize insights that have allowed it to shorten R&D cycles, better understand the dynamics of its business, and carry out certain of types of science in silico.

Tim Williamson and Emil Eifrem draw on Monsanto’s real-world experience to explain how organizations can use graph databases to operationalize insights from big data. Tim and Emil discuss Monsanto’s big data stack, using examples from Monsanto’s substantial experience with graphs, and describe the service-oriented graph architecture that has already handled over one billion requests and is available to over 150 developers, data scientists, and applications throughout Monsanto.

Bda10805b9e6f7538744eb6cf8827b61?s=128

Tim Williamson

September 28, 2016
Tweet

Transcript

  1. Using Graph Databases to Operationalize Insights from Big Data Emil

    Eifrem – CEO @ Neo Technology Tim Williamson – Data Scientist @ Monsanto
  2. Why are we here Today? 1.What is a Graph? 2.Graphs

    in Real-Time 3.Graphs are Feeding the World
  3. @TimWilliate Data Management in 1980 Paper Forms Tiny RAM Spinning

    Platters (Low Capacity / Sequential IO)
  4. Traditional DBMS Technology

  5. Data Management in 2016 Dynamic Real-World Systems SSD/Flash (High-Capacity Storage

    & Ultra-Fast Random I/O) Abundant RAM
  6. A Way of Representing Data DATA DATA

  7. A Way of Representing Data Relational Database Good for: •

    Well-understood data structures that don’t change too frequently • Known problems involving discrete parts of the data, or minimal connectivity DATA 1980s
  8. A Way of Representing Data Graph Database Relational Database Good

    for: • Dynamic systems: where the data topology is difficult to predict • Dynamic requirements: that evolve with the business • Problems where the relationships in data contribute meaning & value Good for: • Well-understood data structures that don’t change too frequently • Known problems involving discrete parts of the data, or minimal connectivity 1980s 2016
  9. KNOWS NAME: ANN AGE: 32 NODE PROPERTIES RELATIONSHIP A Graph

    Is
  10. A Graph Is

  11. A Graph Is

  12. A Graph Is

  13. Describing Graphs Business Domain Ann Dan Loves Graph Data Model

    (Dan) (Ann) -[:LOVES]-> Cypher Query
  14. Cypher Example HR Query in SQL The Same Query using

    Cypher MATCH (boss)-[:MANAGES*0..3]->(sub), (sub)-[:MANAGES*1..3]->(report) WHERE boss.name = “John Doe” RETURN sub.name AS Subordinate, count(report) AS Total Project Impact Less time writing queries • More time understanding the answers • Leaving time to ask the next question Less time debugging queries: • More time writing the next piece of code • Improved quality of overall code base Code that’s easier to read: • Faster ramp-up for new project members • Improved maintainability & troubleshooting
  15. Users Love Cypher

  16. openCypher

  17. Low Latency Query Performance “We found Neo4j to be literally

    thousands of times faster than our prior MySQL solution, with queries that require 10-100 times less code. Today, Neo4j provides eBay with functionality that was previously impossible.” - Volker Pacher, Senior Developer “Minutes to milliseconds” performance Queries up to 1000x faster than RDBMS or other NoSQL
  18. Fastest Growing Category in Big Data Sep 2015 May 2015

    Jan 2015 Sep 2014 May 2014 Jan 2014 Sep 2013 May 2013 100 Popularity Changes 500 600 700 200 300 400 Jan 2013 © DB-Engines.com 2015 • Wide column stores • RDF stores • Document stores • Search engines • Native XML DBMS • Key-value stores • Object oriented DBMS • Multivalue DBMS • Times Series DBMS Relational database Graph Database
  19. Popular Graph Database Use Cases Real-Time Recommendations Fraud Detection Network

    & IT Operations Master Data Management Graph-Based Search Identity & Access Management
  20. What is Real-Time? @TimWilliate

  21. Real-Time When Emil Was in School “A system is said

    to be real-time if the total correctness of an operation depends not only upon its logical correctness, but also upon the time limit in which it is performed.” Shin, K.G.; Ramanathan, P. (Jan 1994)."Real-time computing: a new discipline of computer science and engineering”. Proceedings of the IEEE.
  22. Real-Time In Web 2.0 “My focus will be companies exploiting

    ‘real-time data,’ which is ‘the next billion dollar market opportunity.’” Interview in TechCrunch, 2009 Ron Conway, angel investor godfather of silicon valley
  23. Real-Time Emil and Tim’s Definition of Real-Time Data

  24. Real-Time Emil and Tim’s Definition of Real-Time Data

  25. Real-Time Emil and Tim’s Definition of Real-Time Data

  26. Real-Time Emil and Tim’s Definition of Real-Time Data

  27. Graphs Are Feeding the World @TimWilliate

  28. Improving Genetics has Scaled Agricultural Output for Millennia @TimWilliate

  29. Modern Breeding Techniques Accelerated this Gain Source: http://www.ers.usda.gov/data-products/feed-grains-database/feed-grains-yearbook-tables.aspx @TimWilliate

  30. Selecting Better Plants via Field Trial @TimWilliate

  31. Rapid Breeding Improvement Derives from Cycling @TimWilliate

  32. None
  33. The Operational Uses for Ancestry are Numerous § Which crosses

    are predicted to be the most effective? § Where in the pipeline are the descendants of a cross? § Are the results of high-throughput genotyping correct? § What is the frequency of commercial success? § Etc… @TimWilliate Questions like these are asked from applications across the pipeline, all serving scientists expecting to make rapid decisions
  34. Operationalizing Ancestry Requires Low-Latency Reads A population at the “advancing”

    horizon of the pipeline can easily have an ancestry > 50 levels deep @TimWilliate
  35. Low Latency Reads + Fresh Data = Real-Time Data @TimWilliate

  36. Accessing Genetic Ancestry in a RESTful Style @TimWilliate § Domain-centric

    API § ~ 40 API resources § ~ 20 query grammar elements
  37. Accessing Genetic Ancestry in a RESTful Style @TimWilliate § Domain-centric

    API § ~ 40 API resources § ~ 20 query grammar elements
  38. Accessing Genetic Ancestry in a RESTful Style @TimWilliate § Domain-centric

    API § ~ 40 API resources § ~ 20 query grammar elements {“nodes”: [ {“id”: 1}, {“id”: 2}, {“id”: 3}, {“id”: 4}, {“id”: 5} ], “relationships”: [ {“from”: 1, “to”: 3, “parental_role”: “female”}, {“from”: 2, “to”: 3, “parental_role”: “male”}, {“from”: 3, “to”: 4, “parental_role”: “female”}, {“from”: 4, “to”: 5, “parental_role”: “female”} ]} /population/5/ancestors
  39. Accessing Genetic Ancestry in a RESTful Style @TimWilliate § Domain-centric

    API § ~ 40 API resources § ~ 20 query grammar elements {“nodes”: [ {“id”: 1}, {“id”: 2}, {“id”: 3}, {“id”: 4}, {“id”: 5} ], “relationships”: [ {“from”: 1, “to”: 3, “parental_role”: “female”}, {“from”: 2, “to”: 3, “parental_role”: “male”}, {“from”: 3, “to”: 4, “parental_role”: “female”}, {“from”: 4, “to”: 5, “parental_role”: “female”} ]} { “female”: {“id”: 1}, “male”: {“id”: 2} } /population/5/ancestors /population/5/binary-cross
  40. An Ops View of Ancestry-as-a-Service § 2 years continuous production

    operation § > 200 application and data scientist users § Store Size - ~ 800 million nodes - ~ 1.3 billion relationships - ~ 1.8 billion properties Continuous and peaky mixed read/write load @TimWilliate
  41. The Ultimate Value of Ancestry is Realized in the Biological

    Information it Allows to be Linked @TimWilliate
  42. Corn Parent Galaxy The complete genetic history of every corn

    parent at Monsanto
  43. Selecting Better Plants via Genome Wide Selection @TimWilliate

  44. Thank You!