Slide 1

Slide 1 text

Using Graph Databases to Operationalize Insights from Big Data Emil Eifrem – CEO @ Neo Technology Tim Williamson – Data Scientist @ Monsanto

Slide 2

Slide 2 text

Why are we here Today? 1.What is a Graph? 2.Graphs in Real-Time 3.Graphs are Feeding the World

Slide 3

Slide 3 text

@TimWilliate Data Management in 1980 Paper Forms Tiny RAM Spinning Platters (Low Capacity / Sequential IO)

Slide 4

Slide 4 text

Traditional DBMS Technology

Slide 5

Slide 5 text

Data Management in 2016 Dynamic Real-World Systems SSD/Flash (High-Capacity Storage & Ultra-Fast Random I/O) Abundant RAM

Slide 6

Slide 6 text

A Way of Representing Data DATA DATA

Slide 7

Slide 7 text

A Way of Representing Data Relational Database Good for: • Well-understood data structures that don’t change too frequently • Known problems involving discrete parts of the data, or minimal connectivity DATA 1980s

Slide 8

Slide 8 text

A Way of Representing Data Graph Database Relational Database Good for: • Dynamic systems: where the data topology is difficult to predict • Dynamic requirements: that evolve with the business • Problems where the relationships in data contribute meaning & value Good for: • Well-understood data structures that don’t change too frequently • Known problems involving discrete parts of the data, or minimal connectivity 1980s 2016

Slide 9

Slide 9 text

KNOWS NAME: ANN AGE: 32 NODE PROPERTIES RELATIONSHIP A Graph Is

Slide 10

Slide 10 text

A Graph Is

Slide 11

Slide 11 text

A Graph Is

Slide 12

Slide 12 text

A Graph Is

Slide 13

Slide 13 text

Describing Graphs Business Domain Ann Dan Loves Graph Data Model (Dan) (Ann) -[:LOVES]-> Cypher Query

Slide 14

Slide 14 text

Cypher Example HR Query in SQL The Same Query using Cypher MATCH (boss)-[:MANAGES*0..3]->(sub), (sub)-[:MANAGES*1..3]->(report) WHERE boss.name = “John Doe” RETURN sub.name AS Subordinate, count(report) AS Total Project Impact Less time writing queries • More time understanding the answers • Leaving time to ask the next question Less time debugging queries: • More time writing the next piece of code • Improved quality of overall code base Code that’s easier to read: • Faster ramp-up for new project members • Improved maintainability & troubleshooting

Slide 15

Slide 15 text

Users Love Cypher

Slide 16

Slide 16 text

openCypher

Slide 17

Slide 17 text

Low Latency Query Performance “We found Neo4j to be literally thousands of times faster than our prior MySQL solution, with queries that require 10-100 times less code. Today, Neo4j provides eBay with functionality that was previously impossible.” - Volker Pacher, Senior Developer “Minutes to milliseconds” performance Queries up to 1000x faster than RDBMS or other NoSQL

Slide 18

Slide 18 text

Fastest Growing Category in Big Data Sep 2015 May 2015 Jan 2015 Sep 2014 May 2014 Jan 2014 Sep 2013 May 2013 100 Popularity Changes 500 600 700 200 300 400 Jan 2013 © DB-Engines.com 2015 • Wide column stores • RDF stores • Document stores • Search engines • Native XML DBMS • Key-value stores • Object oriented DBMS • Multivalue DBMS • Times Series DBMS Relational database Graph Database

Slide 19

Slide 19 text

Popular Graph Database Use Cases Real-Time Recommendations Fraud Detection Network & IT Operations Master Data Management Graph-Based Search Identity & Access Management

Slide 20

Slide 20 text

What is Real-Time? @TimWilliate

Slide 21

Slide 21 text

Real-Time When Emil Was in School “A system is said to be real-time if the total correctness of an operation depends not only upon its logical correctness, but also upon the time limit in which it is performed.” Shin, K.G.; Ramanathan, P. (Jan 1994)."Real-time computing: a new discipline of computer science and engineering”. Proceedings of the IEEE.

Slide 22

Slide 22 text

Real-Time In Web 2.0 “My focus will be companies exploiting ‘real-time data,’ which is ‘the next billion dollar market opportunity.’” Interview in TechCrunch, 2009 Ron Conway, angel investor godfather of silicon valley

Slide 23

Slide 23 text

Real-Time Emil and Tim’s Definition of Real-Time Data

Slide 24

Slide 24 text

Real-Time Emil and Tim’s Definition of Real-Time Data

Slide 25

Slide 25 text

Real-Time Emil and Tim’s Definition of Real-Time Data

Slide 26

Slide 26 text

Real-Time Emil and Tim’s Definition of Real-Time Data

Slide 27

Slide 27 text

Graphs Are Feeding the World @TimWilliate

Slide 28

Slide 28 text

Improving Genetics has Scaled Agricultural Output for Millennia @TimWilliate

Slide 29

Slide 29 text

Modern Breeding Techniques Accelerated this Gain Source: http://www.ers.usda.gov/data-products/feed-grains-database/feed-grains-yearbook-tables.aspx @TimWilliate

Slide 30

Slide 30 text

Selecting Better Plants via Field Trial @TimWilliate

Slide 31

Slide 31 text

Rapid Breeding Improvement Derives from Cycling @TimWilliate

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

The Operational Uses for Ancestry are Numerous § Which crosses are predicted to be the most effective? § Where in the pipeline are the descendants of a cross? § Are the results of high-throughput genotyping correct? § What is the frequency of commercial success? § Etc… @TimWilliate Questions like these are asked from applications across the pipeline, all serving scientists expecting to make rapid decisions

Slide 34

Slide 34 text

Operationalizing Ancestry Requires Low-Latency Reads A population at the “advancing” horizon of the pipeline can easily have an ancestry > 50 levels deep @TimWilliate

Slide 35

Slide 35 text

Low Latency Reads + Fresh Data = Real-Time Data @TimWilliate

Slide 36

Slide 36 text

Accessing Genetic Ancestry in a RESTful Style @TimWilliate § Domain-centric API § ~ 40 API resources § ~ 20 query grammar elements

Slide 37

Slide 37 text

Accessing Genetic Ancestry in a RESTful Style @TimWilliate § Domain-centric API § ~ 40 API resources § ~ 20 query grammar elements

Slide 38

Slide 38 text

Accessing Genetic Ancestry in a RESTful Style @TimWilliate § Domain-centric API § ~ 40 API resources § ~ 20 query grammar elements {“nodes”: [ {“id”: 1}, {“id”: 2}, {“id”: 3}, {“id”: 4}, {“id”: 5} ], “relationships”: [ {“from”: 1, “to”: 3, “parental_role”: “female”}, {“from”: 2, “to”: 3, “parental_role”: “male”}, {“from”: 3, “to”: 4, “parental_role”: “female”}, {“from”: 4, “to”: 5, “parental_role”: “female”} ]} /population/5/ancestors

Slide 39

Slide 39 text

Accessing Genetic Ancestry in a RESTful Style @TimWilliate § Domain-centric API § ~ 40 API resources § ~ 20 query grammar elements {“nodes”: [ {“id”: 1}, {“id”: 2}, {“id”: 3}, {“id”: 4}, {“id”: 5} ], “relationships”: [ {“from”: 1, “to”: 3, “parental_role”: “female”}, {“from”: 2, “to”: 3, “parental_role”: “male”}, {“from”: 3, “to”: 4, “parental_role”: “female”}, {“from”: 4, “to”: 5, “parental_role”: “female”} ]} { “female”: {“id”: 1}, “male”: {“id”: 2} } /population/5/ancestors /population/5/binary-cross

Slide 40

Slide 40 text

An Ops View of Ancestry-as-a-Service § 2 years continuous production operation § > 200 application and data scientist users § Store Size - ~ 800 million nodes - ~ 1.3 billion relationships - ~ 1.8 billion properties Continuous and peaky mixed read/write load @TimWilliate

Slide 41

Slide 41 text

The Ultimate Value of Ancestry is Realized in the Biological Information it Allows to be Linked @TimWilliate

Slide 42

Slide 42 text

Corn Parent Galaxy The complete genetic history of every corn parent at Monsanto

Slide 43

Slide 43 text

Selecting Better Plants via Genome Wide Selection @TimWilliate

Slide 44

Slide 44 text

Thank You!