Materials Project overview - MTAGS 2012

Community Accessible Datastore of High-Throughput Calculations: Experiences from the Materials
Project Dan Gunter, Shreyas Cholia, Anubhav Jain, Michael Kocher, Kristin Persson, Lavanya Ramakrishnan Shyue Ping Ong, Gerbrand Ceder

BACKGROUND November 12, 2012 Slide 1

November 12, 2012 2 Our energy future relies
on the rapid development of novel functional materials. But it takes almost twenty years to develop new materials. How can we do it faster? Solar cells, advanced batteries, TCOs, and fuel cells will all play a role in our energy future.

Materials Genome Initiative November 12, 2012 3 June
2011: Materials Genome Ini/a/ve which aims to “fund computa(onal tools, so-ware, new methods for material characteriza2on, and the development of open standards and databases that will make the process of discovery and development of advanced materials faster, less expensive, and more predictable” Source: "Materials Genome IniBaBve for Global CompeBBveness" hFp://www.whitehouse.gov/sites/default/ﬁles/microsites/ostp/materials_genome_iniBaBve-‐ﬁnal.pdf

It's the , stupid! November 12, 2012 Really hard
work on some computaBons FantasBc paper in a journal Really hard work on some computaBons FantasBc paper in a journal Black Hole data data Drink margaritas FantasBc paper in a journal DB data Brilliant analysis Brilliant analysis Brilliant analysis Escape velocity? data data data data

Very specialized skill-set November 12, 2012 5 Physics
Deep dive on speciﬁc soYware Computer Science Really hard work on some computations

Example November 12, 2012 6 Predicted and measured
performance of of Li9 V3 (P2 O7 )3 (PO4 )2 during cell cycling. The Materials Project used quantum chemistry calculations to screen over 20,000 materials as potential cathodes for Li ion batteries. From the results, three new materials were identiﬁed, tested, and currently have patents pending.

COMPONENTS November 12, 2012 7

November 12, 2012 8 Parallel computation Parallel HPC
resources Datastore Data dissemination Collaborative tools Web server Analysis library Science apps Data V&V Midrange compute resources Workﬂow HPC storage Data Data analytics

NoSQL Datastore November 12, 2012 9 Powerful but
simple query language Ease of administration Good performance on read-heavy workloads where most of the data can ﬁt into memory. Poor performance at huge scale Bad for write-heavy workloads

FireWorks workflow engine November 12, 2012 11 Programmability.
Scripting, not GUIs and DSL’s. Administration overhead. No extra servers. Flexibility. DB support, reconfiguring running workflows. Re-runs / Branches Detours Duplicates Iteration Why?! Need to do all this

FireWorks challenges November 12, 2012 12 A% A%
Detours (about 10-20% of jobs fail and must be rerun with different input parameters) Branches (based on the result of a calculation, the entire workflow might need to be modified) Duplicate Job detection (if two workflows contain an identical step, ensure that the step is only run once) The workflow must know when these use cases happen, and act appropriately based on the output of a job. How can the user define these use cases in advance?

Web UI November 12, 2012 14 3-D model
of unit cell Disqus comment button Detailed structure X-ray diffraction pattern (interactive) Bandstructure and Density of states (interactive) Calculation iterations Comments

November 12, 2012 16 https://www.materialsproject.org/rest/v1/materials/Fe2O3/vasp/energy Preamble Version Application
I.D. Datatype Property MAPI pymatgen + Application

pymatgen November 12, 2012 17

LESSONS LEARNED November 12, 2012 18

Running on HPC •  Batch queues and large numbers of
jobs with unpredictable runtimes •  Talking to the database – getting off-node difﬁcult – need to get databases white-listed November 12, 2012 19

Data analytics •  Scaling community contributions to code – tough under
best of circumstances – this ain't the best of circumstances •  Scaling analytic functions – "learning" which compounds are stable – need to get data to appropriate programming model (MapReduce, Parallel R, ..) November 12, 2012 20

Data V&V •  Loading new data into a production DB
– No dedicated resources for this – Automation a must •  Constant validation and veriﬁcation – (see above) – MapReduce – Ticket/bug system November 12, 2012 21

Data dissemination •  Security and privacy – OpenID to reduce overhead
– Sharing model for data ("sandboxes") – Shouldn't this build on broader practices? •  (A: yes, but how to do this and get something done) •  Query performance – see next slide November 12, 2012 22

November 12, 2012 Slide 23 Time (seconds) Number
of queries 0 2000 4000 6000 8000 0.1 1 10 100 Date Query time (seconds), log10 0.5 1 5 13−Aug 20−Aug 27−Aug Query times, August 2012 0.5 1 5 Time (seconds) 13-Aug 20-Aug 27-Aug

FUTURE WORK November 12, 2012 24

Opening up data access November 12, 2012 25

November 12, 2012 26 Compute properties Stability and
synthesis Materials Project Source ideas User sandboxes MP Workﬂow (b) (a) (c) (d) (e) pym atgen MP datastore (f) Towards materials design

Questions? November 12, 2012 27 materialsproject.org

Materials Project overview - MTAGS 2012

Materials Project overview - MTAGS 2012

Dan Gunter

More Decks by Dan Gunter

Other Decks in Science

Featured

Transcript

Community Accessible Datastore of High-Throughput Calculations: Experiences from the Materials

BACKGROUND November 12, 2012 Slide 1

November 12, 2012 2 Our energy future relies

Materials Genome Initiative November 12, 2012 3 June

It's the , stupid! November 12, 2012 Really hard

Very specialized skill-set November 12, 2012 5 Physics

Example November 12, 2012 6 Predicted and measured

COMPONENTS November 12, 2012 7

November 12, 2012 8 Parallel computation Parallel HPC

NoSQL Datastore November 12, 2012 9 Powerful but

November 12, 2012 10 Parallel computation Parallel HPC

FireWorks workﬂow engine November 12, 2012 11 Programmability.

FireWorks challenges November 12, 2012 12 A% A%

November 12, 2012 13 Parallel computation Parallel HPC

Web UI November 12, 2012 14 3-D model

November 12, 2012 15 Parallel computation Parallel HPC

November 12, 2012 16 https://www.materialsproject.org/rest/v1/materials/Fe2O3/vasp/energy Preamble Version Application

pymatgen November 12, 2012 17

LESSONS LEARNED November 12, 2012 18

Running on HPC •  Batch queues and large numbers of

Data analytics •  Scaling community contributions to code – tough under

Data V&V •  Loading new data into a production DB

Data dissemination •  Security and privacy – OpenID to reduce overhead

November 12, 2012 Slide 23 Time (seconds) Number

FUTURE WORK November 12, 2012 24

Opening up data access November 12, 2012 25

November 12, 2012 26 Compute properties Stability and

Questions? November 12, 2012 27 materialsproject.org