Open Chemistry, JupyterLab and data: Reproducible quantum chemistry

Reproducible Quantum Chemistry Dr. Marcus D. Hanwell @mhanwell Technical Leader
American Chemical Society Orlando, FL 31 March, 2019

What Is Open Chemistry? • Umbrella of related projects to
coordinate and group ◦ Focus on 3-clause BSD permissively licensed projects ◦ Aims for more complete solution • Initially three related projects ◦ Avogadro 2 - editor, visualization, interaction with small number of molecules ◦ MoleQueue - running computational jobs, abstracting local and remote execution ◦ MongoChem - database for interacting with many molecules, summarizing data, informatics • Evolved over the years but still retains many of those goals ◦ GitHub organization with 35 repositories at the last count • Umbrella organization in Google Summer of Code ◦ Four years, with 3, 7, 7, and TBD students over a broad range of projects ◦ Hope to continue this and other community engagement activities https://openchemistry.org/

Why Jupyter? • Supports interactive analysis while preserving the analytic
steps ◦ Preserves much of the provenance • Familiar environment and language ◦ Many are already familiar with the environment ◦ Python is the language of scientific computing • Simple extension mechanism ◦ Particularly with JupyterLab ◦ Allows for complex domain specific visualization • Vibrant ecosystem and community

Open Chemistry, Avogadro, Jupyter and Web • Making data more
accessible • Federated, open data repositories • Modern HTML5 interfaces • JSON data format for NWChem data as a prototype, add to other QM codes • What about working with the data? • Can we have chemistry from desktop-to-phone ◦ Create data, upload, organize ◦ Search and analyze data ◦ Share data - email, social media, publications • What if we tied a data server to a Jupyter notebook? • Can we make data a first class citizen in modern workflows?

Increased Reusability • Benefit from a huge number of open
source packages/projects • Quantum chemistry codes ◦ NWChem, Psi4, ... • Open source libraries/utilities ◦ Avogadro, Open Babel, cclib, RDKit, ... • Visualization, charting, etc ◦ vtk.js, 3DMol.js, D3, plotly, matplotlib, ... • Web frameworks ◦ React, stencil.js, npm, ... • Languages ◦ C++, Python, JavaScript, TypeScript, ... • Containers ◦ Docker, singularity, shifter, ... Also version control such as git, continuous integration such as CircleCI, build systems such as CMake, project hosting such as GitHub, hardware accelerated rendering such as WebGL, queuing systems like grid engine, semantic data stores like Jena, format standards such as JSON, MessagePack, HDF5, XML, HTTP, RESTful web service standards, servers such as nginx, CherryPy, Flask, and many other components that are used directly or gave useful input

Increased Reusability • Developed on GitHub under permissive OSI-approved licenses
◦ Industry standard 3-clause BSD and Apache 2 mainly • Web widgets using stencil.js to offer web tags • Binary wheels for Python wrapped Avogadro core ◦ pip install avogadro • Pip installable Python modules for standard functions ◦ pip install openchemistry • JupyterLab extensions that can be installed locally • Binder for “live” notebooks hosted in cloud containers • Quantum codes and machine learning models in Docker containers • Establishing data standards for reliable data exchange

Approach and Philosophy • Data is the core of the
platform ◦ Start with a simple but powerful date model and data server • RESTful APIs are ubiquitous ◦ Use from notebooks, apps, command line, desktop, etc • Jupyter notebooks for interactive analysis ◦ High level domain specific Python API within the notebooks • Web application ◦ Authentication, access control, management tasks ◦ Launching, searching, managing notebooks ◦ Interact with data outside of the notebook

Reusable Web Visualization Widgets

Data, Python, Jupyter, Chemistry

Responsive Design

Getting the Platform

Containers and the Swarm

Reproducibility for Chemical-Physics Data • Dream - share results like
we can currently share code • Links to interactive pages displaying data • Those pages link to workflows/Jupyter notebooks • From input geometry/molecule through to final figure • Docker containers offer known, reproducible binary ◦ Metadata has input parameters, container ID, etc • Aid reproducibility, machine learning, and education • Federate access, offer full worked examples - editable!

Docker Containers for Chemical-Physics • Developed three containers so far
to serve the platform ◦ NWChem and Psi4 for computational chemistry ◦ ChemML for machine learning • These containers are self-contained workflow tools ◦ Take JSON and input geometry ◦ Use a Python-based execution script ◦ Output JSON and optionally all output logs/data • Run using Docker, Singularity, soon Shifter on AWS, locally, NERSC • Simple contract making it easy to add more codes to the platform ◦ Take some standard input, translate for your code, translate to standard output ◦ Get workflow management, integration with Jupyter, visualization, ... • The Dockerfile has build instructions, DockerHub hosts images

Psi4 Dockerfile

Running a Psi4 Docker Container • Can be run independently
of the framework • docker run -v $(pwd):/data openchemistry/psi4:latest ◦ -g /data/geometry.xyz ◦ -p /data/parameters.json ◦ -o /data/out.cjson ◦ -s /data/scratch • Runs a Python driver script that interprets switches • Perform input/output translation, input generation, etc • Packages a code for use in a larger workflow

Running a NWChem Docker Container • Can be run independently
of the framework • docker run -v $(pwd):/data openchemistry/nwchem:latest ◦ -g /data/geometry.xyz ◦ -p /data/parameters.json ◦ -o /data/out.cjson ◦ -s /data/scratch • Runs a Python driver script that interprets switches • Perform input/output translation, input generation, etc • Packages a code for use in a larger workflow

Export to Binder • Goes beyond simply showing the static
notebook • Specific GitHub repository layout ◦ Install custom Python modules ◦ Install JupyterLab extensions • Service builds a container on the fly • Can click on a link and run the example container http://mybinder.org/v2/gh/openchemistry/jupyter-examples/master?urlpath=lab/tree/caffeine.ipynb

Export to Binder

Machine Learning • What happens after your model is trained
and published? • Can we treat machine learning models like other codes making predictions? • Lots of new moving parts that need to managed ◦ The actual machine learning code, possible accelerator access, etc ◦ The trained model, loading it, executing it reproducibly ◦ Generation of relevant descriptors as part of the input ◦ Extracting output, storing, displaying, and visualizing data • Starts to share a number of commonalities with other simulations • Important differences too ◦ Narrower focus for most models ◦ Possibility to augment trained models, create derived models

Running ChemML in a Jupyter Notebook

Data Mining • When running calculations all data, metadata, workflows
are captured • Creation of a structured data store with a friendly frontend • Possible to perform queries and perform analytics on the data generated • Machine learning can feed off of this data ◦ Reuse the same infrastructure to initiate and generate new data ◦ Comparison of predicted data to computational codes, experimental data ◦ Use of a familiar JupyterLab interface • Augmenting the notebook with a data server that can access compute ◦ Notebook acts as initiator for large jobs ◦ Returning to the notebook later to check on progress • Independent RESTful APIs, web frontend, batch export of data

Chemical JSON • Developed to support projects (~2011) • Stores
structure, geometry, identifiers, descriptors, other useful data • Benefits: ◦ More compact than XML/CML ◦ Native to MongoDB, JSON-RPC, REST ◦ Easily converted to binary representation • Now features basis sets, MOs, sets • MessagePack a good option for binary • Maps easily to HDF5 binary data store • MolSSI JSON schema collaboration

Papers and a Little History on Chemical JSON • Quixote
collaboration with Peter Murray-Rust (2011) ◦ “The Quixote project: Collaborative and Open Quantum Chemistry data management in the Internet age”, https://doi.org/10.1186/1758-2946-3-38 • Early work in CML with NWChem and Avogadro (2013) ◦ “From data to analysis: linking NWChem and Avogadro with the syntax and semantics of Chemical Markup Language” https://doi.org/10.1186/1758-2946-5-25 • Later moved to JSON, RESTful API, visualization (2017) ◦ “Open chemistry: RESTful web APIs, JSON, NWChem and the modern web application” ◦ https://doi.org/10.1186/s13321-017-0241-z • Interested in Linked Data, JSON-LD, and how they might be layered on top • Use of BSON, HDF5, and related technologies for binary data • BSD licensed reference implementations

Pillars of Phase II SBIR Project 1. Data and metadata
◦ JSON, JSON-LD, HDF5 and semantic web 2. Server platform ◦ RESTful APIs, computational chemistry, data, machine learning, HPC/cloud, and triple store 3. Jupyter integration ◦ Computational chemistry, data, machine learning, query, analytics, and data visualization 4. Web application ◦ Management interfaces, single-page interface, notebook/data browser, and search 5. Avogadro and local Python ◦ Python shell integration, extension of Avogadro to use server interface, editing data on server Regular automated software deployments, releases with Docker containers

Closing Thoughts • Nearly halfway through the Phase II project
• Data and software are both central and core to the platform • Highly reusable through licensing, modular nature, data standards, containers • Augmented by abstracted access to compute resources • Open source, developing entry points for customization and extension • Building on best-of-breed open source community projects • Extending to better support the chemistry community ◦ Just at the start of making machine learning and data mining first class citizens • User friendly interfaces, Python at the core, visualization, data analytics • SBIR funding from DOE Office of Science contract DE-SC0017193 ◦ Collaborating with Bert de Jong at Berkeley Lab and Johannes Hachmann at SUNY Buffalo

Open Chemistry, JupyterLab and data: Reproducib...

Open Chemistry, JupyterLab and data: Reproducible quantum chemistry

Marcus Hanwell

More Decks by Marcus Hanwell

Other Decks in Science

Featured

Transcript

Reproducible Quantum Chemistry Dr. Marcus D. Hanwell @mhanwell Technical Leader

What Is Open Chemistry? • Umbrella of related projects to

Why Jupyter? • Supports interactive analysis while preserving the analytic

Open Chemistry, Avogadro, Jupyter and Web • Making data more

Increased Reusability • Benefit from a huge number of open

Increased Reusability • Developed on GitHub under permissive OSI-approved licenses

Approach and Philosophy • Data is the core of the

Reusable Web Visualization Widgets

Data, Python, Jupyter, Chemistry

Responsive Design

Getting the Platform

Containers and the Swarm

Reproducibility for Chemical-Physics Data • Dream - share results like

Docker Containers for Chemical-Physics • Developed three containers so far

Psi4 Dockerfile

Running a Psi4 Docker Container • Can be run independently

Running a NWChem Docker Container • Can be run independently

Export to Binder • Goes beyond simply showing the static

Export to Binder

Machine Learning • What happens after your model is trained

Running ChemML in a Jupyter Notebook

Data Mining • When running calculations all data, metadata, workflows

Chemical JSON • Developed to support projects (~2011) • Stores

Papers and a Little History on Chemical JSON • Quixote

Pillars of Phase II SBIR Project 1. Data and metadata

Closing Thoughts • Nearly halfway through the Phase II project