Genomic data analysis workflow as a chain of docker containers on high-performance computing system

Research Institute for Information and Systems Database Center for Life
Science Tazro Ohta <[email protected]> (FOPNJDEBUBBOBMZTJTXPSLGMPXBTBDIBJOPGEPDLFSDPOUBJOFST POIJHIQFSGPSNBODFDPNQVUJOHTZTUFN 5B[SP0IUB 0TBNV0HBTBXBSB ISM High Performance Computing Conference October 09, 2015

1. Introduction of DBCLS/DDBJ  Institutes for bioinformatics and HPC 2.
Genomics as an application of HPC  “Next-generation” of genomics and computing platform 3. Docker container on HPC for biology  Machine readable format and sharing platform Agenda

Institutes for bioinformatics and HPC Introduction of DBCLS/DDBJ

Introduction of DBCLS • Providing biological database and related services
• Availability, Accessibility, and Accountability of DBs • Linked Data and Semantic Web • Data analysis platform • Text mining • Funding from NBDC • Collaboration with DDBJ Database Center for Life Science, Research Organization of Information and Systems

Introduction of DBCLS Database Center for Life Science, Research Organization
of Information and Systems

Introduction of DDBJ • Public data repository under National Institute
of Genetics • Collaborating with NCBI, EBI • Supercomputer system *for free* to academic users DNA Data Bank of Japan, National Institute of Genetics, Research Organization of Information and Systems

“Next-generation” of genomics and computing platform Genomics as an application
of HPC

1. DNA sequencing and big data for biology   Current
status of “next generation” 2. General research workflow  How biologists use computer 3. Problems and challenges  What biologists expect for “high-performance” computing platform Genomics as an application of HPC

1. Rise of high-throughput sequencing   From human genome project
to next-generation sequencing 2. Public repository for data sharing  Reference genome, 1000 genomes, Sequencing Read Archive DNA sequencing and big data for biology

https://ﬂxlexblog.wordpress.com/2014/06/11/developments-in-next-generation-sequencing-june-2014-edition/

http://www.ncbi.nlm.nih.gov/Traces/sra/

1. Data processing   de novo assemble and reference alignment:
We need HPC 2. Statistical analysis and Visualisation  R, scipy, and other scripting languages: We do on our laptop General research workflow

https://speakerdeck.com/michaelbarton/ranking-genome-assemblers-with-docker-containers-dockercon-eu-2014

http://www.historyofnimr.org.uk/mill-hill-essays/essays-yearly-volumes/2010-2/bringing-it-all-back-home-next-generation-sequencing-technology-and-you/

http://circos.ca/intro/genomic_data/ http://pages.genemania.org/plugin/

1. Storage, RAM and network  Volume, Disk I/O, data cache,
data transfer 2. Library and Software  Workflow, versioning, repeatability/reproducibility Problems and challenges

length(tools) > 680 http://seqanswers.com/wiki/Software

http://www.slideshare.net/sjcockell/reproducibility-the-myths-and-truths-of-pipeline-bioinformatics

Machine readable format and sharing platform Docker container on HPC
for biology

1. Technologies  Base technologies for proposed system 2. Implementation  Format
and prototype engine 3. Challenges  Toward the future of database and HPC Docker container on HPC for biology

1. Docker  Container based virtualization 2. Apache Mesos  Resource management
system Technologies

https://www.docker.com/whatisdocker/

https://mesosphere.com/learn/

1. Workflow execution engine  Container workflow engine 2. Sharable workflow
description  Data exchange format Implementation

Apache Mesos + Chronos manager Node Node Node Node Node
1VCMJD1SJWBUF %PDLFS3FHJTUSZ %PDLFSGJMFT XPSLGMPXKTPO %BUB Storage post post/get transfer push pull run mount 6TFS

workflow.json workflow.sh PAST FUTURE • Post to GridEngine • Run
binary software • Pre-install/build required • Post to workflow manager • Run docker container • Improved portability

1. World standard format for data exchange  Common workflow language
project 2. HPC as a data sharing platform  Data, Processing, and Publication Challenges

https://github.com/common-workﬂow-language/common-workﬂow-language

HPC as a data sharing platform Packaging Whole Research Activities
Research Activity Time Course Details of   Project Design Publication  Text, Figs Sampling Primary Data Data Processing  & Analysis Wet Experiments

Packaging Whole Research Activities Research Activity Time Course Details of
  Project Design Sampling Primary Data Data Processing  & Analysis Publication  Text, Figs Wet Experiments BioProject BioSample Genbank, DRA HPC as a data sharing platform

Users will do; 1. Describe lab specifications for each general
process model by study plan construction client to generate process model 2. Allocate resources for each process by resource allocation client 3. <execute experiments> 4. Input data to get output Plan and Experiments model

Packaging Whole Research Activities Research Activity Time Course Details of
  Project Design Sampling Primary Data Data Processing  & Analysis Publication  Text, Figs Wet Experiments supercool framework HPC as a data sharing platform

Details of Project Design Sampling Primary Data Data Processing &
Analysis Publication Text, Figs Wet Experiments Packaging Whole Research Activities supercool framework HPC as a data sharing platform

Details of Project Design Sampling Primary Data Data Processing &
Analysis Publication Text, Figs Wet Experiments supercool framework ·΄Ζ Image from: http:// www.aist.go.jp/ Portals/0/ resource_images/ aist_j/aistinfo/ aist_today/vol14_11/ vol14_11_p06.pdf

1. Genomics requires HPC and its good interface 2. Container
virtualization on HPC will solve problems 3. DB+HPC for Reproducible Research Summary

• Organizers of HPCCON 2015 • This work was supported
by ROIS URA Grant “༥߹γʔζ୳ࡧ” 2014. • The Institute of Statistical Mathematics • Dr. Yoshiyasu Tamura • Dr. Junji Nakano • Dr. Keisuke Honda • National Institute of Informatics • Dr. Kenjiro Yamanaka • Dr. Kento Aida • Dr. Shigetoshi Yokoyama • Dr. Yoshinobu Masatani • National Institute of Genetics • Dr. Osamu Ogasawara • Dr. Takeshi Tsurusawa • NIG SuperComputer Facilities SE team • Information and Mathematical Science and Bioinformatics Co., Ltd. • Tatsuya Nishizawa • Tokyo Institute of Technology • Dr. Shinichi Miura • Dr. Satoshi Matsuoka • Colleagues and Members of DBCLS, DDBJ, and Open-Bio • BioHackathon Hackers • Jean-Luc Perret, Alexander Garcia, and Erick Antezana for wet-lab protocols • Bruno Vieira, Jeremy Nguyen, and Evan Bolton for the fantastic help to write my abstract Acknowledgement

Genomic data analysis workflow as a chain of do...

Genomic data analysis workflow as a chain of docker containers on high-performance computing system

More Decks by Tazro Inutano Ohta

Other Decks in Research

Featured

Transcript