Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Materials Project Validation, Provenance, and Sandboxes

Dan Gunter
August 06, 2014

Materials Project Validation, Provenance, and Sandboxes

Summary of Goals, Progress, and Next steps for these three aspects of the Materials Project (materialsproject.org) infrastructure

* Validation: constantly guard against bugs in core data and imported data

* Provenance: know how data came to be

* Sandboxes: combine public and non-public data; "good fences make good neighbors"

Dan Gunter

August 06, 2014
Tweet

More Decks by Dan Gunter

Other Decks in Science

Transcript

  1. Goals •  Validation – constantly guard against bugs in core data

    and imported data •  Provenance – know how data came to be •  Sandboxes – Combine public and non-public data; "good fences make good neighbors"
  2. Validation runs all the time •  Rules with "constraints" for

    every database (and sandbox) •  Test constraints against entire DB every night ! email reports •  Validation engine, etc. all open-source software in pymatgen-db Remote   server   Valida/on   engine   Rules   MP  Databases   Reports   (email,  web  pages,  ..)  
  3. Rules have a simple syntax _aliases: - snl_id = mps_id

    - energy = analysis.e_above_hull materials: - filter: constraints: - final_energy_per_atom <= 0 - initial_structure.lattice.volume > 0 - initial_structure.lattice.a > 0 - initial_structure.lattice.b > 0 - initial_structure.lattice.c > 0 - initial_structure.lattice.matrix size 3 - formation_energy_per_atom <= 5 - formation_energy_per_atom > -5 - cpu_time > 5 - e_above_hull > -0.000001 - final_energy < 0 - reduced_cell_formula size$ nelements # Check num. ICSD sources for selected compounds - filter: - task_id = "mp-540081" constraints: - icsd_id size> 10 - filter: - task_id = "mp-20379" constraints: - icsd_id size 1 - filter: - task_id = "mp-13634" constraints: - icsd_id size> 0 - filter: - task_id = "mp-600022" constraints: - icsd_id size 0 # NiO2 phases should never become stable - filter: - e_above_hull = 0 constraints: - pretty_formula != 'NiO2' tasks: - filter: - state = "successful" constraints: - output.final_energy_per_atom <= 0
  4. Validation summary Easy-to-use, integrated, efficient tools to report errors Next

    steps – Record all check results in DB – More sophisticated checks (Map/Reduce) – Make it easier to add new checks internally – Make it easier to add new check for anyone •  per-sandbox or even per-user ("MP Alerts")
  5. Types of provenance in the system 1)  Calculation workflows – 

    FireWorks records calculation inputs, .. results in great detail 2)  External datasets –  Structure Notation Language standardizes the naming of data sources and publications 3)  Post-calculation data transformations –  New "builders" provides framework for tracking creation of final database products (1) (2) (3)
  6. Provenance in DB Structure Notation Language "snl_final": { "about": {

    "created_at": { "string": "2014-02-22 19:07:00.383869", "@class": "datetime", "@module": "datetime" }, "_materialsproject": { "submission_id": 52621, "snl_id": 398676, "spacegroup": { "lattice_type": "tetragonal", "symbol": "P4_2/ mmc", "number": 131, "point_group": "4/ mmm", "crystal_system": "tetragonal", "hall": "-P 4c 2" } }, "_cedergroup": { "BURP_sids": [ 409544, 409545, 409546 ], "icsd_ids": [ ], "e_above_hull": 0.075125350000000423734 }, "references": "", "authors": [ { "name": "Geoffroy Hautier", "email": "geoffroy.hautier@uclouvain .be" }, { "name": "Bo Xu", "email": "[email protected]" } ], "remarks": [ "supplementary compounds from MIT matgen database" ], "projects": [ "MIT matgen" ], "history": [ { "url": "http://www.fiz- karlsruhe.de/ icsd_home.html", "name": "Inorganic Crystal Structure Database", "description": { "Collection code": 24692 } }, { "url": "", "name": "", "description": { "source": null, "orig_name": "Basic substitution code.", "formula": "O1 Pd1" } }, { "url": "http:// ceder.mit.edu/", "name": "MIT Ceder group research database", "description": { "source": 105986, "orig_name": "", "formula": "FeO" } }, { "url": "http:// www.materialsproject.org", "name": "Materials Project structure optimization", "description": { "fw_id": 820305, "task_type": "GGA optimize structure (2x)", "task_id": "mp-753682" } }, { "url": "http:// www.materialsproject.org", "name": "Materials Project structure optimization", "description": { "fw_id": 820308, "task_type": "GGA +U optimize structure (2x)", "task_id": "mp-776678" } } ] }, Metadata Crystal DB sources References History of structure optimizations
  7. Future work: unified view of provenance VASP result ICSD VASP

    result VASP result Post- processing Material properties Computation Data import processing e.g., Defects
  8. Sandboxes = Database + Apps Core  data   Core  data

         +   mul/valent   materials   Non- JCESR users JCESR users
  9. Technical challenges •  Pre-process data for real-time search •  Interfaces

    for per-user access control –  https://materialsproject.org/materials/1234? sandbox=jcesr – Web UI elements and
  10. Future: dynamic sandbox creation Current: – Large & significant additional data

    / apps •  e.g., JCESR – Longer-term connections to MP data •  e.g. porous materials – Companies •  e.g. VW/Stanford Future small collab. per-user? CoD?
  11. Summary •  Validation – guard against bugs by checking all data

    daily and at data import/creation time •  Provenance – universal standard for annotating data provenance •  Sandboxes – unified view of distinct databases – onramp for new collaborations and data