Graphs are Feeding the World

Graphs are Feeding the World  Tim Williamson (@TimWilliate)  Data Scientist
Monsanto

Our Growing Planet Faces Difficult Challenges Sources: http://esa.un.org/unpd/wpp/; UN FAO
Food Balance Sheet, “World Health Organization Global and regional food consumption patterns and trends”; The World Bank, Food and Agriculture Organization of the United Nations (FAO-STAT), Monsanto Internal Calculations; @TimWilliate #MonDataScience Rising Population Growing enough for a growing world Global Population 1980 TODAY 2050 4.4B 7.1B 9.6B+ Limited Farmland Farmers will need to produce enough food with fewer resources to support our world population Acres per Person 1961 2050 1 <1/3 Changing Economies and Diets A growing global middle class is choosing animal protein – meat, eggs, and dairy – as a larger part of their diet Dietary Percentage of Protein 14% 1965 2030 9% Changing Climate Farmers are impacted by climate change in many ways: WATER AVAILABILITY ISSUES INCREASINGLY UNPREDICTABLE WEATHER INSECT RANGE EXPANSION WEED PRESSURE CHANGES CROP DISEASE INCREASES PLANTING ZONE SHIFTS

Improved Genetic Gain is One of Several Tools Humanity
has to Address These Challenges Sources: http://www.ers.usda.gov/data-products/feed-grains-database/feed-grains-yearbook-tables.aspx • 8 commodity crops and 18 vegetable crop families, sold in 160 countries Average US Corn Yield 1866 - 2014 Yield (Bushels/Acre) 0 45 90 135 180 Year 1865 1890 1915 1940 1965 1990 2015 @TimWilliate #MonDataScience 10,000 Years

Genetic Gain is Created Through Breeding Cycles @TimWilliate #MonDataScience X
Lab Data (Genotypes) Field Data (Phenotypes) Lab Data (Genotypes) Field Data (Phenotypes) Lab Data (Genotypes) Lab Data (Genotypes) Select the Best, Discard the Rest All Progeny of Two Parents Enter Best One Leaves to Become a Future Parent 1000’s crosses/year Dozens progeny/cross 5-10 locations/progeny $3-5 million/year Screening Field Trials

Every Breeding Cycle Extends a Tree of Genetic Ancestry @TimWilliate
#MonDataScience C A B A B C

A single parent

Forcing Genetic Ancestry Data into Rows and Columns • In
our relational store, genetic ancestry data was spread across a hierarchy of ~11 tables representing a total of ~895 million rows • Every read became an unpleasant exercise in CONNECT BY PRIOR @TimWilliate #MonDataScience Plant Plant:Plant Relationship plant id attributes… plant id parent plant id parental role

Given a Starting Population, Return All Ancestors Response Time (s)
0 6 12 18 24 30 Depth 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 SQL on Oracle Exadata @TimWilliate #MonDataScience

Genetic Ancestry is a Naturally Occurring Graph • ~700 million
nodes • ~1.2 billion relationships • ~1.7 billion properties @TimWilliate #MonDataScience :Plant :Plant :PARENT :Plant Inventory :Plant Inventory :PARENT :Planting :PLANTED :Selection :SELECTED :HARVESTED :INVENTORY

Given a Starting Population, Return All Ancestors Response Time (s)
0 6 12 18 24 30 Depth 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 SQL on Oracle Exadata Traversal Framework on Neo4j ~90x Difference @TimWilliate #MonDataScience

Retrieving Genetic Ancestry in a ‘RESTful’ Style 4 2 3
:PARENT {parental_role: male} :PARENT {parental_role: female} 1 5 :PARENT {parental_role: female} :PARENT {parental_role: female} 6 :PARENT {parental_role: female} /population/1/ancestors RESTful Resource {“nodes”: [ {“id”: 1}, {“id”: 2}, {“id”: 3}, {“id”: 4}, {“id”: 5}, {“id”: 6} ], “relationships”: [ {“from”: 1, “to”: 2, “relation”: “PARENT”}, {“from”: 2, “to”: 3, “relation”: “PARENT”}, {“from”: 2, “to”: 4, “relation”: “PARENT”}, {“from”: 3, “to”: 5, “relation”: “PARENT”} {“from”: 4, “to”: 6, “relation”: “PARENT”} ]} @TimWilliate #MonDataScience

Building a Grammar for Ancestral Milestones /population/1/binary-‐cross RESTful Resource {
“male”: {“id”: 4}, “female”: {“id”: 3} } 4 2 3 :PARENT {parental_role: male} :PARENT {parental_role: female} 1 5 :PARENT {parental_role: female} :PARENT {parental_role: female} 6 :PARENT {parental_role: female} @TimWilliate #MonDataScience

Pruning Genetic Ancestry Trees ‘On the Fly’ /population/1/ancestors?until-‐first=binary-‐cross RESTful Resource
{“nodes”: [ {“id”: 1}, {“id”: 2}, {“id”: 3}, {“id”: 4} ], “relationships”: [ {“from”: 1, “to”: 2, “relation”: “PARENT”}, {“from”: 2, “to”: 3, “relation”: “PARENT”}, {“from”: 2, “to”: 4, “relation”: “PARENT”} ]} 4 2 3 :PARENT {parental_role: male} :PARENT {parental_role: female} 1 5 :PARENT {parental_role: female} :PARENT {parental_role: female} 6 :PARENT {parental_role: female} @TimWilliate #MonDataScience

Ancestry-‐as-‐a-‐Service is Released September 2014 REST API (Ancestry-as-a-Service) Data Scientists
Application Developers • >30 elements of RESTful grammar • ~120 applications and data scientists • > 600 million REST requests • 10x performance boost • 1 month analysis now takes 3 hours @TimWilliate #MonDataScience

Real-‐Time Reads Require Real-‐Time Data • Ingestion volume is ~10
million writes/day (not a write heavy flow) • https://github.com/MonsantoCo/goldengate-‐kafka-‐adapter Field + Lab Applications { “table”: “foo” “type”: “INSERT” “columns”: [ { “name”: “bar”, “before”: “fizz”, “after”: “buzz” } ] } REST API REST API (Ancestry-as-a-Service) POST /population PUT /population/1234 PUT /population/parents DELETE /population @TimWilliate #MonDataScience

We’ve Got Ancestry Figured Out…What’s Next? Genotype Phenotype Environment Ancestry
@TimWilliate #MonDataScience

Layering Genotype Data Over Ancestry Trees Genotype nodes act
as simple pointers to remote systems which store the raw data @TimWilliate #MonDataScience :Plant :Plant :PARENT :Plant Inventory :Plant Inventory :PARENT :Planting :PLANTED :Selection :SELECTED :HARVESTED :INVENTORY :Genotype :HAS_GENOTYPE :Genotype :HAS_GENOTYPE

Retrieving Ancestry Trees Annotated with Genotypes {“nodes”: [ {“id”:
1, “genotypes”: [{“id”: 123}]}, {“id”: 2}, {“id”: 3}, {“id”: 4, “genotypes”: [{“id”: 456}]}, {“id”: 5, “genotypes”: [{“id”: 789}]} ], “relationships”: [ {“from”: 1, “to”: 2, “relation”: “PARENT}”, {“from”: 2, “to”: 3, “relation”: “PARENT}”, {“from”: 3, “to”: 4, “relation”: “PARENT”}, {“from”: 3, “to”: 5, “relation”: “PARENT”} ]} 3 2 1 :Genotype {marker_count: 300} :Genotype {marker_count: 60,000} :Genotype {marker_count: 60,000} 5 4 /population/1/ancestors?until=genotyped-‐ancestor&props=genotypes @TimWilliate #MonDataScience

Estimate the Genotype of Every Seed Produced Genotypes Field +
Lab Applications REST API REST API (Ancestry-as-a-Service) Genotype Estimation Engine Genotype Annotated Ancestry Trees Required Genotype DataSets Estimated Genotypes New Estimated Genotypes Messages @TimWilliate #MonDataScience

Let’s Revisit the Flow of a Breeding Cycle @TimWilliate #MonDataScience
X Lab Data (Genotypes) Estimate Hi-Res Genotypes Lab Data (Genotypes) Field Data (Phenotypes) Lab Data (Genotypes) Lab Data (Genotypes) Select the Best, Discard the Rest All Progeny of Two Parents Enter Best One Leaves to Become a Future Parent 1000’s crosses/year Dozens progeny/cross 1 genotype/progeny < $1 million/year Genome-Wide Selection Width of Pipeline Increases to Accommodate More Crosses

A Glimpse Inside Our Active ‘Graphy’ Work Sources: http://biodiversitylibrary.org/page/27066167#page/125/mode/1up @TimWilliate
#MonDataScience

Constructing Coancestry Matrices A B C E D G F
A B C D E F G A 1 0.5 0.5 0.25 0.25 0.25 0.25 B 1 0 0.5 0.5 0 0 C 1 0 0 0.5 0.5 D 1 0 0 0 E 1 0 0 F 1 0 G 1 Coancestry(A) • Consider a reduced ancestor tree only between crosses • A progeny inherits 50% of its genetics from each parent • Key input for a large class of predictive genetic analysis algorithms @TimWilliate #MonDataScience

Thank You All @TimWilliate http://engineering.monsanto.com/ Special thanks to my teammates
• Jason Clark • Marshall Marietta

Graphs are Feeding the World

Graphs are Feeding the World

Tim Williamson

More Decks by Tim Williamson

Other Decks in Programming

Featured

Transcript

Graphs are Feeding the World  Tim Williamson (@TimWilliate)  Data Scientist

Our Growing Planet Faces Difficult Challenges Sources: http://esa.un.org/unpd/wpp/; UN FAO

Improved Genetic Gain is One of Several Tools Humanity

Genetic Gain is Created Through Breeding Cycles @TimWilliate #MonDataScience X

Every Breeding Cycle Extends a Tree of Genetic Ancestry @TimWilliate

A single parent

Forcing Genetic Ancestry Data into Rows and Columns • In

Given a Starting Population, Return All Ancestors Response Time (s)

Genetic Ancestry is a Naturally Occurring Graph • ~700 million

Given a Starting Population, Return All Ancestors Response Time (s)

Retrieving Genetic Ancestry in a ‘RESTful’ Style 4 2 3

Building a Grammar for Ancestral Milestones /population/1/binary-‐cross RESTful Resource {

Pruning Genetic Ancestry Trees ‘On the Fly’ /population/1/ancestors?until-‐first=binary-‐cross RESTful Resource

Ancestry-‐as-‐a-‐Service is Released September 2014 REST API (Ancestry-as-a-Service) Data Scientists

Real-‐Time Reads Require Real-‐Time Data • Ingestion volume is ~10

We’ve Got Ancestry Figured Out…What’s Next? Genotype Phenotype Environment Ancestry

Layering Genotype Data Over Ancestry Trees Genotype nodes act

Retrieving Ancestry Trees Annotated with Genotypes {“nodes”: [ {“id”:

Estimate the Genotype of Every Seed Produced Genotypes Field +

Let’s Revisit the Flow of a Breeding Cycle @TimWilliate #MonDataScience

A Glimpse Inside Our Active ‘Graphy’ Work Sources: http://biodiversitylibrary.org/page/27066167#page/125/mode/1up @TimWilliate

Constructing Coancestry Matrices A B C E D G F

Thank You All @TimWilliate http://engineering.monsanto.com/ Special thanks to my teammates