Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Genomic data analysis workflow as a chain of docker containers on high-performance computing system

Genomic data analysis workflow as a chain of docker containers on high-performance computing system

ISM High Performance Computing Conference, Oct 9-10

Tazro Inutano Ohta

October 09, 2015
Tweet

More Decks by Tazro Inutano Ohta

Other Decks in Research

Transcript

  1. Research Institute for Information and Systems Database Center for Life

    Science Tazro Ohta <[email protected]> (FOPNJDEBUBBOBMZTJTXPSLGMPXBTBDIBJOPGEPDLFSDPOUBJOFST POIJHIQFSGPSNBODFDPNQVUJOHTZTUFN 5B[SP0IUB 0TBNV0HBTBXBSB ISM High Performance Computing Conference October 09, 2015
  2. 1. Introduction of DBCLS/DDBJ
 Institutes for bioinformatics and HPC 2.

    Genomics as an application of HPC
 “Next-generation” of genomics and computing platform 3. Docker container on HPC for biology
 Machine readable format and sharing platform Agenda
  3. Introduction of DBCLS • Providing biological database and related services

    • Availability, Accessibility, and Accountability of DBs • Linked Data and Semantic Web • Data analysis platform • Text mining • Funding from NBDC • Collaboration with DDBJ Database Center for Life Science, Research Organization of Information and Systems
  4. Introduction of DDBJ • Public data repository under National Institute

    of Genetics • Collaborating with NCBI, EBI • Supercomputer system *for free* to academic users DNA Data Bank of Japan, National Institute of Genetics, Research Organization of Information and Systems
  5. 1. DNA sequencing and big data for biology 
 Current

    status of “next generation” 2. General research workflow
 How biologists use computer 3. Problems and challenges
 What biologists expect for “high-performance” computing platform Genomics as an application of HPC
  6. 1. Rise of high-throughput sequencing 
 From human genome project

    to next-generation sequencing 2. Public repository for data sharing
 Reference genome, 1000 genomes, Sequencing Read Archive DNA sequencing and big data for biology
  7. 1. Data processing 
 de novo assemble and reference alignment:

    We need HPC 2. Statistical analysis and Visualisation
 R, scipy, and other scripting languages: We do on our laptop General research workflow
  8. 1. Storage, RAM and network
 Volume, Disk I/O, data cache,

    data transfer 2. Library and Software
 Workflow, versioning, repeatability/reproducibility Problems and challenges
  9. 1. Technologies
 Base technologies for proposed system 2. Implementation
 Format

    and prototype engine 3. Challenges
 Toward the future of database and HPC Docker container on HPC for biology
  10. 1. Workflow execution engine
 Container workflow engine 2. Sharable workflow

    description
 Data exchange format Implementation
  11. Apache Mesos + Chronos manager Node Node Node Node Node

    1VCMJD1SJWBUF %PDLFS3FHJTUSZ %PDLFSGJMFT XPSLGMPXKTPO %BUB Storage post post/get transfer push pull run mount 6TFS
  12. workflow.json workflow.sh PAST FUTURE • Post to GridEngine • Run

    binary software • Pre-install/build required • Post to workflow manager • Run docker container • Improved portability
  13. 1. World standard format for data exchange
 Common workflow language

    project 2. HPC as a data sharing platform
 Data, Processing, and Publication Challenges
  14. HPC as a data sharing platform Packaging Whole Research Activities

    Research Activity Time Course Details of 
 Project Design Publication
 Text, Figs Sampling Primary Data Data Processing
 & Analysis Wet Experiments
  15. Packaging Whole Research Activities Research Activity Time Course Details of

    
 Project Design Sampling Primary Data Data Processing
 & Analysis Publication
 Text, Figs Wet Experiments BioProject BioSample Genbank, DRA HPC as a data sharing platform
  16. Users will do; 1. Describe lab specifications for each general

    process model by study plan construction client to generate process model 2. Allocate resources for each process by resource allocation client 3. <execute experiments> 4. Input data to get output Plan and Experiments model
  17. Packaging Whole Research Activities Research Activity Time Course Details of

    
 Project Design Sampling Primary Data Data Processing
 & Analysis Publication
 Text, Figs Wet Experiments supercool framework HPC as a data sharing platform
  18. Details of Project Design Sampling Primary Data Data Processing &

    Analysis Publication Text, Figs Wet Experiments Packaging Whole Research Activities supercool framework HPC as a data sharing platform
  19. Details of Project Design Sampling Primary Data Data Processing &

    Analysis Publication Text, Figs Wet Experiments supercool framework ·΄Ζ Image from: http:// www.aist.go.jp/ Portals/0/ resource_images/ aist_j/aistinfo/ aist_today/vol14_11/ vol14_11_p06.pdf
  20. 1. Genomics requires HPC and its good interface 2. Container

    virtualization on HPC will solve problems 3. DB+HPC for Reproducible Research Summary
  21. • Organizers of HPCCON 2015 • This work was supported

    by ROIS URA Grant “༥߹γʔζ୳ࡧ” 2014. • The Institute of Statistical Mathematics • Dr. Yoshiyasu Tamura • Dr. Junji Nakano • Dr. Keisuke Honda • National Institute of Informatics • Dr. Kenjiro Yamanaka • Dr. Kento Aida • Dr. Shigetoshi Yokoyama • Dr. Yoshinobu Masatani • National Institute of Genetics • Dr. Osamu Ogasawara • Dr. Takeshi Tsurusawa • NIG SuperComputer Facilities SE team • Information and Mathematical Science and Bioinformatics Co., Ltd. • Tatsuya Nishizawa • Tokyo Institute of Technology • Dr. Shinichi Miura • Dr. Satoshi Matsuoka • Colleagues and Members of DBCLS, DDBJ, and Open-Bio • BioHackathon Hackers • Jean-Luc Perret, Alexander Garcia, and Erick Antezana for wet-lab protocols • Bruno Vieira, Jeremy Nguyen, and Evan Bolton for the fantastic help to write my abstract Acknowledgement