Genomic data analysis workflow as a chain of docker containers on high-performance computing system

Genomic data analysis workflow as a chain of docker containers on high-performance computing system

ISM High Performance Computing Conference, Oct 9-10

991f3366d9cc17386e6a66ef4abc6dbc?s=128

Tazro Inutano Ohta

October 09, 2015
Tweet

Transcript

  1. Research Institute for Information and Systems Database Center for Life

    Science Tazro Ohta <t.ohta@dbcls.rois.ac.jp> (FOPNJDEBUBBOBMZTJTXPSLGMPXBTBDIBJOPGEPDLFSDPOUBJOFST POIJHIQFSGPSNBODFDPNQVUJOHTZTUFN 5B[SP0IUB 0TBNV0HBTBXBSB ISM High Performance Computing Conference October 09, 2015
  2. 1. Introduction of DBCLS/DDBJ
 Institutes for bioinformatics and HPC 2.

    Genomics as an application of HPC
 “Next-generation” of genomics and computing platform 3. Docker container on HPC for biology
 Machine readable format and sharing platform Agenda
  3. Institutes for bioinformatics and HPC Introduction of DBCLS/DDBJ

  4. Introduction of DBCLS • Providing biological database and related services

    • Availability, Accessibility, and Accountability of DBs • Linked Data and Semantic Web • Data analysis platform • Text mining • Funding from NBDC • Collaboration with DDBJ Database Center for Life Science, Research Organization of Information and Systems
  5. Introduction of DBCLS Database Center for Life Science, Research Organization

    of Information and Systems
  6. Introduction of DDBJ • Public data repository under National Institute

    of Genetics • Collaborating with NCBI, EBI • Supercomputer system *for free* to academic users DNA Data Bank of Japan, National Institute of Genetics, Research Organization of Information and Systems
  7. “Next-generation” of genomics and computing platform Genomics as an application

    of HPC
  8. 1. DNA sequencing and big data for biology 
 Current

    status of “next generation” 2. General research workflow
 How biologists use computer 3. Problems and challenges
 What biologists expect for “high-performance” computing platform Genomics as an application of HPC
  9. 1. Rise of high-throughput sequencing 
 From human genome project

    to next-generation sequencing 2. Public repository for data sharing
 Reference genome, 1000 genomes, Sequencing Read Archive DNA sequencing and big data for biology
  10. https://flxlexblog.wordpress.com/2014/06/11/developments-in-next-generation-sequencing-june-2014-edition/

  11. None
  12. http://www.ncbi.nlm.nih.gov/Traces/sra/

  13. 1. Data processing 
 de novo assemble and reference alignment:

    We need HPC 2. Statistical analysis and Visualisation
 R, scipy, and other scripting languages: We do on our laptop General research workflow
  14. https://speakerdeck.com/michaelbarton/ranking-genome-assemblers-with-docker-containers-dockercon-eu-2014

  15. https://speakerdeck.com/michaelbarton/ranking-genome-assemblers-with-docker-containers-dockercon-eu-2014

  16. https://speakerdeck.com/michaelbarton/ranking-genome-assemblers-with-docker-containers-dockercon-eu-2014

  17. https://speakerdeck.com/michaelbarton/ranking-genome-assemblers-with-docker-containers-dockercon-eu-2014

  18. http://www.historyofnimr.org.uk/mill-hill-essays/essays-yearly-volumes/2010-2/bringing-it-all-back-home-next-generation-sequencing-technology-and-you/

  19. http://circos.ca/intro/genomic_data/ http://pages.genemania.org/plugin/

  20. 1. Storage, RAM and network
 Volume, Disk I/O, data cache,

    data transfer 2. Library and Software
 Workflow, versioning, repeatability/reproducibility Problems and challenges
  21. None
  22. None
  23. None
  24. None
  25. None
  26. None
  27. None
  28. length(tools) > 680 http://seqanswers.com/wiki/Software

  29. None
  30. http://www.slideshare.net/sjcockell/reproducibility-the-myths-and-truths-of-pipeline-bioinformatics

  31. Machine readable format and sharing platform Docker container on HPC

    for biology
  32. 1. Technologies
 Base technologies for proposed system 2. Implementation
 Format

    and prototype engine 3. Challenges
 Toward the future of database and HPC Docker container on HPC for biology
  33. 1. Docker
 Container based virtualization 2. Apache Mesos
 Resource management

    system Technologies
  34. https://www.docker.com/whatisdocker/

  35. https://mesosphere.com/learn/

  36. 1. Workflow execution engine
 Container workflow engine 2. Sharable workflow

    description
 Data exchange format Implementation
  37. Apache Mesos + Chronos manager Node Node Node Node Node

    1VCMJD1SJWBUF %PDLFS3FHJTUSZ %PDLFSGJMFT XPSLGMPXKTPO %BUB Storage post post/get transfer push pull run mount 6TFS
  38. workflow.json workflow.sh PAST FUTURE • Post to GridEngine • Run

    binary software • Pre-install/build required • Post to workflow manager • Run docker container • Improved portability
  39. 1. World standard format for data exchange
 Common workflow language

    project 2. HPC as a data sharing platform
 Data, Processing, and Publication Challenges
  40. https://github.com/common-workflow-language/common-workflow-language

  41. HPC as a data sharing platform Packaging Whole Research Activities

    Research Activity Time Course Details of 
 Project Design Publication
 Text, Figs Sampling Primary Data Data Processing
 & Analysis Wet Experiments
  42. Packaging Whole Research Activities Research Activity Time Course Details of

    
 Project Design Sampling Primary Data Data Processing
 & Analysis Publication
 Text, Figs Wet Experiments BioProject BioSample Genbank, DRA HPC as a data sharing platform
  43. Users will do; 1. Describe lab specifications for each general

    process model by study plan construction client to generate process model 2. Allocate resources for each process by resource allocation client 3. <execute experiments> 4. Input data to get output Plan and Experiments model
  44. Packaging Whole Research Activities Research Activity Time Course Details of

    
 Project Design Sampling Primary Data Data Processing
 & Analysis Publication
 Text, Figs Wet Experiments supercool framework HPC as a data sharing platform
  45. Details of Project Design Sampling Primary Data Data Processing &

    Analysis Publication Text, Figs Wet Experiments Packaging Whole Research Activities supercool framework HPC as a data sharing platform
  46. Details of Project Design Sampling Primary Data Data Processing &

    Analysis Publication Text, Figs Wet Experiments supercool framework ·΄Ζ Image from: http:// www.aist.go.jp/ Portals/0/ resource_images/ aist_j/aistinfo/ aist_today/vol14_11/ vol14_11_p06.pdf
  47. 1. Genomics requires HPC and its good interface 2. Container

    virtualization on HPC will solve problems 3. DB+HPC for Reproducible Research Summary
  48. • Organizers of HPCCON 2015 • This work was supported

    by ROIS URA Grant “༥߹γʔζ୳ࡧ” 2014. • The Institute of Statistical Mathematics • Dr. Yoshiyasu Tamura • Dr. Junji Nakano • Dr. Keisuke Honda • National Institute of Informatics • Dr. Kenjiro Yamanaka • Dr. Kento Aida • Dr. Shigetoshi Yokoyama • Dr. Yoshinobu Masatani • National Institute of Genetics • Dr. Osamu Ogasawara • Dr. Takeshi Tsurusawa • NIG SuperComputer Facilities SE team • Information and Mathematical Science and Bioinformatics Co., Ltd. • Tatsuya Nishizawa • Tokyo Institute of Technology • Dr. Shinichi Miura • Dr. Satoshi Matsuoka • Colleagues and Members of DBCLS, DDBJ, and Open-Bio • BioHackathon Hackers • Jean-Luc Perret, Alexander Garcia, and Erick Antezana for wet-lab protocols • Bruno Vieira, Jeremy Nguyen, and Evan Bolton for the fantastic help to write my abstract Acknowledgement