Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Open Chemistry, JupyterLab and data: Reproducib...

Open Chemistry, JupyterLab and data: Reproducible quantum chemistry

The Open Chemistry project is developing an ambitious platform to facilitate reproducible quantum chemistry workflows by integrating the best of breed open source projects currently available in a cohesive platform with extensions specific to the needs of quantum chemistry. The core of the project is a Python-based data server capable of storing metadata, executing quantum chemistry calculations, and processing the output. The platform exposes RESTful endpoints using programming language agnostic web endpoints, and uses Linux container technology to package quantum codes that are often difficult to build.

The Jupyter project has been leveraged as a web-based frontend offering reproducibility as a core principle. This has been coupled with the data server to initiate quantum chemistry calculations, cache results, make them searchable, and even visualize the results within a modern browser environment. The Avogadro libraries have been reused for visualization workflows, coupled with Open Babel for file translation, and examples of the use of NWChem and Psi4 will be demonstrated.

The core of the platform is developed upon JSON data standards, and encouraging the wider adoption of JSON/HDF5 as the principle storage mediums. A single page web application using React at its core will be shown for sharing simple views of data output, and linking to the Jupyter notebooks that documents how they were made. Command line tools and links to the Avogadro graphical interface will be shown demonstrating capabilities from web through to desktop.

Marcus Hanwell

March 31, 2019
Tweet

More Decks by Marcus Hanwell

Other Decks in Science

Transcript

  1. Reproducible Quantum Chemistry Dr. Marcus D. Hanwell @mhanwell Technical Leader

    American Chemical Society Orlando, FL 31 March, 2019
  2. What Is Open Chemistry? • Umbrella of related projects to

    coordinate and group ◦ Focus on 3-clause BSD permissively licensed projects ◦ Aims for more complete solution • Initially three related projects ◦ Avogadro 2 - editor, visualization, interaction with small number of molecules ◦ MoleQueue - running computational jobs, abstracting local and remote execution ◦ MongoChem - database for interacting with many molecules, summarizing data, informatics • Evolved over the years but still retains many of those goals ◦ GitHub organization with 35 repositories at the last count • Umbrella organization in Google Summer of Code ◦ Four years, with 3, 7, 7, and TBD students over a broad range of projects ◦ Hope to continue this and other community engagement activities https://openchemistry.org/
  3. Why Jupyter? • Supports interactive analysis while preserving the analytic

    steps​ ◦ Preserves much of the provenance​ • Familiar environment and language​ ◦ Many are already familiar with the environment​ ◦ Python is the language of scientific computing​ • Simple extension mechanism​ ◦ Particularly with JupyterLab​ ◦ Allows for complex domain specific visualization​ • Vibrant ecosystem and community​ ​
  4. Open Chemistry, Avogadro, Jupyter and Web • Making data more

    accessible • Federated, open data repositories • Modern HTML5 interfaces • JSON data format for NWChem data as a prototype, add to other QM codes • What about working with the data? • Can we have chemistry from desktop-to-phone ◦ Create data, upload, organize ◦ Search and analyze data ◦ Share data - email, social media, publications • What if we tied a data server to a Jupyter notebook? • Can we make data a first class citizen in modern workflows?
  5. Increased Reusability • Benefit from a huge number of open

    source packages/projects • Quantum chemistry codes ◦ NWChem, Psi4, ... • Open source libraries/utilities ◦ Avogadro, Open Babel, cclib, RDKit, ... • Visualization, charting, etc ◦ vtk.js, 3DMol.js, D3, plotly, matplotlib, ... • Web frameworks ◦ React, stencil.js, npm, ... • Languages ◦ C++, Python, JavaScript, TypeScript, ... • Containers ◦ Docker, singularity, shifter, ... Also version control such as git, continuous integration such as CircleCI, build systems such as CMake, project hosting such as GitHub, hardware accelerated rendering such as WebGL, queuing systems like grid engine, semantic data stores like Jena, format standards such as JSON, MessagePack, HDF5, XML, HTTP, RESTful web service standards, servers such as nginx, CherryPy, Flask, and many other components that are used directly or gave useful input
  6. Increased Reusability • Developed on GitHub under permissive OSI-approved licenses

    ◦ Industry standard 3-clause BSD and Apache 2 mainly • Web widgets using stencil.js to offer web tags • Binary wheels for Python wrapped Avogadro core ◦ pip install avogadro • Pip installable Python modules for standard functions ◦ pip install openchemistry • JupyterLab extensions that can be installed locally • Binder for “live” notebooks hosted in cloud containers • Quantum codes and machine learning models in Docker containers • Establishing data standards for reliable data exchange
  7. Approach and Philosophy • Data is the core of the

    platform ◦ Start with a simple but powerful date model and data server • RESTful APIs are ubiquitous ◦ Use from notebooks, apps, command line, desktop, etc • Jupyter notebooks for interactive analysis ◦ High level domain specific Python API within the notebooks • Web application ◦ Authentication, access control, management tasks ◦ Launching, searching, managing notebooks ◦ Interact with data outside of the notebook
  8. Reproducibility for Chemical-Physics Data • Dream - share results like

    we can currently share code • Links to interactive pages displaying data • Those pages link to workflows/Jupyter notebooks • From input geometry/molecule through to final figure • Docker containers offer known, reproducible binary ◦ Metadata has input parameters, container ID, etc • Aid reproducibility, machine learning, and education • Federate access, offer full worked examples - editable!
  9. Docker Containers for Chemical-Physics • Developed three containers so far

    to serve the platform ◦ NWChem and Psi4 for computational chemistry ◦ ChemML for machine learning • These containers are self-contained workflow tools ◦ Take JSON and input geometry ◦ Use a Python-based execution script ◦ Output JSON and optionally all output logs/data • Run using Docker, Singularity, soon Shifter on AWS, locally, NERSC • Simple contract making it easy to add more codes to the platform ◦ Take some standard input, translate for your code, translate to standard output ◦ Get workflow management, integration with Jupyter, visualization, ... • The Dockerfile has build instructions, DockerHub hosts images
  10. Running a Psi4 Docker Container • Can be run independently

    of the framework • docker run -v $(pwd):/data openchemistry/psi4:latest ◦ -g /data/geometry.xyz ◦ -p /data/parameters.json ◦ -o /data/out.cjson ◦ -s /data/scratch • Runs a Python driver script that interprets switches • Perform input/output translation, input generation, etc • Packages a code for use in a larger workflow
  11. Running a NWChem Docker Container • Can be run independently

    of the framework • docker run -v $(pwd):/data openchemistry/nwchem:latest ◦ -g /data/geometry.xyz ◦ -p /data/parameters.json ◦ -o /data/out.cjson ◦ -s /data/scratch • Runs a Python driver script that interprets switches • Perform input/output translation, input generation, etc • Packages a code for use in a larger workflow
  12. Export to Binder • Goes beyond simply showing the static

    notebook • Specific GitHub repository layout ◦ Install custom Python modules ◦ Install JupyterLab extensions • Service builds a container on the fly • Can click on a link and run the example container http://mybinder.org/v2/gh/openchemistry/jupyter-examples/master?urlpath=lab/tree/caffeine.ipynb
  13. Machine Learning • What happens after your model is trained

    and published? • Can we treat machine learning models like other codes making predictions? • Lots of new moving parts that need to managed ◦ The actual machine learning code, possible accelerator access, etc ◦ The trained model, loading it, executing it reproducibly ◦ Generation of relevant descriptors as part of the input ◦ Extracting output, storing, displaying, and visualizing data • Starts to share a number of commonalities with other simulations • Important differences too ◦ Narrower focus for most models ◦ Possibility to augment trained models, create derived models
  14. Data Mining • When running calculations all data, metadata, workflows

    are captured • Creation of a structured data store with a friendly frontend • Possible to perform queries and perform analytics on the data generated • Machine learning can feed off of this data ◦ Reuse the same infrastructure to initiate and generate new data ◦ Comparison of predicted data to computational codes, experimental data ◦ Use of a familiar JupyterLab interface • Augmenting the notebook with a data server that can access compute ◦ Notebook acts as initiator for large jobs ◦ Returning to the notebook later to check on progress • Independent RESTful APIs, web frontend, batch export of data
  15. Chemical JSON • Developed to support projects (~2011) • Stores

    structure, geometry, identifiers, descriptors, other useful data • Benefits: ◦ More compact than XML/CML ◦ Native to MongoDB, JSON-RPC, REST ◦ Easily converted to binary representation • Now features basis sets, MOs, sets • MessagePack a good option for binary • Maps easily to HDF5 binary data store • MolSSI JSON schema collaboration
  16. Papers and a Little History on Chemical JSON • Quixote

    collaboration with Peter Murray-Rust (2011) ◦ “The Quixote project: Collaborative and Open Quantum Chemistry data management in the Internet age”, https://doi.org/10.1186/1758-2946-3-38 • Early work in CML with NWChem and Avogadro (2013) ◦ “From data to analysis: linking NWChem and Avogadro with the syntax and semantics of Chemical Markup Language” https://doi.org/10.1186/1758-2946-5-25 • Later moved to JSON, RESTful API, visualization (2017) ◦ “Open chemistry: RESTful web APIs, JSON, NWChem and the modern web application” ◦ https://doi.org/10.1186/s13321-017-0241-z • Interested in Linked Data, JSON-LD, and how they might be layered on top • Use of BSON, HDF5, and related technologies for binary data • BSD licensed reference implementations
  17. Pillars of Phase II SBIR Project 1. Data and metadata

    ◦ JSON, JSON-LD, HDF5 and semantic web 2. Server platform ◦ RESTful APIs, computational chemistry, data, machine learning, HPC/cloud, and triple store 3. Jupyter integration ◦ Computational chemistry, data, machine learning, query, analytics, and data visualization 4. Web application ◦ Management interfaces, single-page interface, notebook/data browser, and search 5. Avogadro and local Python ◦ Python shell integration, extension of Avogadro to use server interface, editing data on server Regular automated software deployments, releases with Docker containers
  18. Closing Thoughts • Nearly halfway through the Phase II project

    • Data and software are both central and core to the platform • Highly reusable through licensing, modular nature, data standards, containers • Augmented by abstracted access to compute resources • Open source, developing entry points for customization and extension • Building on best-of-breed open source community projects • Extending to better support the chemistry community ◦ Just at the start of making machine learning and data mining first class citizens • User friendly interfaces, Python at the core, visualization, data analytics • SBIR funding from DOE Office of Science contract DE-SC0017193 ◦ Collaborating with Bert de Jong at Berkeley Lab and Johannes Hachmann at SUNY Buffalo