Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reproducibility as a Service: Virtual Appliance for NGS data analysis

Reproducibility as a Service: Virtual Appliance for NGS data analysis

Docker Container, Apache Mesos, and NIG SuperComputer Facilities. Prepared for 第31回 DDBJing 講習会 in 東京. See Also: http://www.slideshare.net/dritoshi/20150612-ddb-jingjst

Tazro Inutano Ohta

June 12, 2015
Tweet

More Decks by Tazro Inutano Ohta

Other Decks in Technology

Transcript

  1. ܰྔԾ૝؀ڥΛ༻͍ͯNGSσʔλͷղੳ࠶ݱੑΛ୲อ͢Δ Reproducibility as a Service: Virtual Appliance for NGS data

    analysis Tazro Ohta [email protected] Database Center for Life Science, Japan prepared for ୈ31ճ DDBJing ߨशձ in ౦ژ 12 Jun. 2015
  2. 1. Computational Resource Management
 ܭࢉػϦιʔε͕଍Γͳ͍ɼͲΕ͚ͩ͋ͬͯ΋શવ଍Γͳ͍ 2. Sharing Analysis Protocols and

    History
 ղੳϓϩτίϧͷڞ༗ͱ࠶࣮ߦͷίετ͕ߴ͍ɼࣗ෼΋ଞਓ΋࠶ݱͰ͖ͳ͍ Challenges
  3. Challenges #1 Computational Resource Management • ͔ͭͯ࣍ੈ୅ͱݺ͹ΕͨγʔέϯαʹΑΔσʔλͷരൃత૿Ճ • ήϊϜՊֶ͸ܭࢉػͱ਺ֶΛۦ࢖͢ΔσʔλɾαΠΤϯε΁ •

    Ҩ఻ݚεύίϯ͸ήϊϜղੳ޲͚ʹεύίϯγεςϜΛߋ৽ (2012) • !!! ׬શແྉ !!! • Ϣʔβ͕૿͑ɺδϣϒ͕૿͑ɺσʔλ͕૿͑ɺෆຬ΋૿͑ͨ • Ҩ఻ݚϢʔβձʹ͓͍ͯར༻ऀͷཁ๬ΛώΞϦϯά • Ϧιʔε͕଍Γͳ͍ • δϣϒ͕࣮ߦͰ͖ͳ͍ • ετϨʔδ͕଍Γͳ͍ • SGEͷݶք
  4. Challenges #1 Computational Resource Management • Ҩ఻ݚεύίϯ͕ཁٻ͞Ε͍ͯΔ͜ͱ • ܭࢉػࢿݯར༻ͷޮ཰͕࠷େԽ͞Ε͍ͯΔ͜ͱ •

    ιϑτ΢ΣΞ։ൃऀʹͱͬͯࣗ༝౓ͷߴ͍؀ڥͰ͋Δ͜ͱ • ܭࢉػͷ஌͕ࣝͳͯ͘΋େن໛ͳσʔλղੳ͕࣮ߦͰ͖Δ͜ͱ • ѻ͏σʔλͷηΩϡϦςΟ͕୲อ͞Ε͍ͯΔ͜ͱ • ओཁͳDB͕࢖͍΍͍͢ঢ়ଶͰΞΫηεͰ͖Δ͜ͱ
  5. Sharing Analysis Protocols and History Challenges #2 • ର࿩తʹεύίϯΛར༻ͨ͠σʔλղੳΛ(ࣗ෼Ͱ|ଞਓ͕)࠶ݱͰ͖ͳ͍ •

    ڞ༻ܭࢉػͰ͸σʔλղੳͷͨΊͷ؀ڥߏஙʹ͔͔Δίετ͕ߴ͘ͳΔ • ؀ڥߏஙʙ࣮ߦ·Ͱͷղੳͷ࠶ݱʹਓखΛཁ͢Δ • ղੳͷ؀ڥઃఆ/࣮ߦཤྺΛڞ༗ɺ࠶࣮ߦ͢ΔͨΊͷํ๏͕౷Ұ͞Ε͍ͯͳ͍ • GalaxyͷHistoryͷΑ͏ͳػೳ͸͋ΒΏΔσʔλղੳʹͱͬͯඞཁ • ݱঢ়Ͱ͸ࣗવݴޠΛհͨ͠৘ใަ׵͕ߦΘΕ͍ͯΔ • Materials and Methods • Supplementary
  6. 1. Virtual Appliance: Linux Containers
 ίϯςφܕԾ૝ͱʮ࣮ߦՄೳͳܭࢉػ؀ڥʯͷهड़ 2. Apache Mesos: Abstract

    Computational Resources
 ܭࢉػࢿݯͷந৅ԽʹΑΔʮόʔνϟϧͳڊେσʔληϯλʔʯͷ࣮ݱ 3. Data Handling and Workflow Execution
 ڊେσʔλͱϫʔΫϑϩʔ࣮ߦɼήϊϜՊֶͷཁٻΛຬٕͨ͢ज़ Technologies
  7. Technologies #1 Virtual Appliance: Linux Containers • Linux Containers (LXC)

    • ෳ਺ͷಠཱͨ͠LinuxγεςϜ(ίϯςφ)Λ1ͭͷLinuxϗετ্ͰՔಇͤ͞Δ • ϋΠύʔόΠβΛՔಇͤͣ͞ɺϦιʔεΛϗετ͔Βಠཱͤ͞Δ • cgroupʹΑΔCPU, memory, block I/O, network, namespaceͷ෼཭ • ͜Ε·ͰͷίϯςφܕԾ૝ͷྲྀΕΛ἞Ήٕज़ • IBM Mainframes LPARɼParallels VirtuozzoɼSoralis Containers • http://en.wikipedia.org/wiki/LXC
  8. Technologies #1 Virtual Appliance: Linux Containers To Docker • Docker

    • ίϯςφܕԾ૝Խͷٕज़͔ΒϓϥοτϑΥʔϜ΁ • LXCʹAUFS (Another Unionfs) Λ૊Έ߹ΘͤͯίϯςφΠϝʔδΛ੍ޚ͢Δ • ϑΝΠϧγεςϜʹࠩ෼σʔλΛॏͶಁաతʹѻ͏ٕज़ • ݱࡏͷDocker (v0.9Ҏ߱) ͸ LXCͰ͸ͳ͘Docker ಠࣗͷ libcontainer Λ࠾༻ • DockerʹΑͬͯίϯςφԾ૝͕τϨϯυʹ • www.docker.com •
  9. Dockerfile: ίϯςφͷϏϧυखॱΛίʔυͰهड़͢Δ (Infrastructure as Code) http://hub.docker.com/u/inutano/fastqc/ $ nano Dockerfile #

    edit Dockerfile $ docker build -t inutano/fastqc . $ docker run -it inutano/fastqc fastqc sample.fastq
  10. Technologies #2 Apache Mesos: Abstract Computational Resources • ίϯςφΛεύίϯͰಈ͔ͨ͢Ίʹ͸ϦιʔεϚωʔδϟͱεέδϡʔϥ͕ඞཁ •

    ैདྷܕͷδϣϒ෼ࢄ࣮ߦΤϯδϯͰ͋ΔSGEͰ͸ॊೈੑʹ΍΍೉͋Γ • ΍ͬͯΔͱ͜Ζ΋͋Δ: http://www.nextflow.io/ • ಈతͳεέʔϧΞ΢τ΍εφοϓγϣοτʹରԠ͍ͨ͠ • Apache Mesos http://mesos.apache.org/ • ෳ਺ͷDC/ܭࢉϊʔυΛଋͶͯ1ͭͷϚγϯͷΑ͏ʹݟͤΔٕज़ • CPU, memory, storageͳͲΛந৅Խ͢Δ • ैདྷͷMPI͚ͩͰͳ͘Hadoop΍SparkͳͲͷ৽͍͠෼ࢄॲཧʹ΋ରԠ • ಠࣗͷεέδϡʔϥΛ࣋ͨͣϑϨʔϜϫʔΫʹର͠Ϧιʔε഑෼ͷAPIΛఏڙ
  11. Technologies #2 Apache Mesos: Various MiddleWares for Scheduling Containers •

    Mesos্Ͱಈ͘൚༻ͷεέδϡʔϥ͕ෳ਺։ൃ͞Ε͍ͯΔ • Chronos • Mesosʹ͓͚ΔCronΛ࣮ݱ͢Δεέδϡʔϥ • Marathon • Mesosʹ͓͚ΔLong-RunningͳΞϓϦέʔγϣϯͷͨΊͷεέδϡʔϥ • Apache Aurora • ChronosͱMarathonͷಛ௃Λซͤ࣋ͪPythonϕʔεͷDSLΛ࣋ͭ • Mesos schedular͸αϧͰ΋ࣗ࡞Ͱ͖Δ • http://www.slideshare.net/wallyqs/mesos-scheduler
  12. Technologies #3 Data Handling and Workflow Execution • ίϯςφ࣮ߦʹΑΔ໰୊ͱͯ͠σʔλӬଓੑͷ໰୊͕͋Δ •

    ϗετͷϑΝΠϧγεςϜ্ͷσΟϨΫτϦΛϚ΢ϯτͯ͠ॻ͖ࠐΈ • ෳ਺ͷԕִDCΛMesosͰந৅Խͨ࣌͠ʹσʔλΛͲͷΑ͏ʹѻ͏͔ʁ • Cloud Burst Buffer by TITech • MMCFTP by NII • ίϯςφΛ૊Έ߹ΘͤͨϫʔΫϑϩʔΛͲͷΑ͏ʹ࣮ߦ͢Δ͔ʁ • ղੳϫʔΫϑϩʔهड़ͷඪ४Խ • Common Workflow Language • https://github.com/common-workflow-language/common-workflow-language
  13. Apache Mesos + Chronos manager Node Node Node Node Node

    1VCMJD1SJWBUF %PDLFS3FHJTUSZ %PDLFSGJMFT XPSLGMPXKTPO %BUB Storage post post/get transfer push pull run mount 6TFS γεςϜུ֓
  14. workflow.json workflow.sh PAST FUTURE • Post to GridEngine • Run

    binary software • Pre-install/build required • Post to workflow manager • Run docker container • Improved portability
  15. workflow.json • JSON format configuration file • Describe a workflow

    contains multiple steps • 1 container for 1 app • Include directory to be mounted on containers
  16. chronos dependent jobs curl -X POST -d @workflow.json • Repeat

    1 • Shipped with suicide job • containers should be finished in a week
  17. Implementation for proof-of-concept Challenges • εέδϡʔϥͷ໰୊ • ChronosͷݶքɼAuroraͷλεΫϞσϧͷήϊϜσʔλղੳͱͷෆ੔߹ • !!!

    ScalaͰεέδϡʔϥॻ͔͘͠ͳ͍ !!! • ϫʔΫϑϩʔهड़ͷ໰୊ • ਓ͕ؒॻ͖΍͍͢͜ͱ vs ަ׵ՄೳͰ࠶ݱੑΛอূͯ͘͠ΕΔ͜ͱ • DSL΍alt-CWLʹඋ͑ͨަ׵ϑΥʔϚοτ/πʔϧͷඞཁੑ • ಛघͳΞʔΩςΫνϟΛར༻͢Δίϯςφͷ໰୊ • MPI, GPGPU΍FPGA • UGEͳͲͷطଘͷεέδϡʔϥ΍HadoopͳͲͷ෼ࢄॲཧϑϨʔϜϫʔΫͱͷڞଘ • ಈతͳεέʔϧΞ΢τͱνΣοΫϙΠϯτͷӡ༻
  18. The future #1 Packaging Whole Research Activities • ݚڀϓϩηεʹ͓͚Δશͯͷ࡞ۀΛػցՄಡͳܗࣜͰهड़͢Δ •

    هड़͞Εͨ΋ͷ͸ͦͷؔ܎ੑΛอͬͨ··σʔλϕʔεʹ֨ೲ͢Δ • “Research Package Submission” • ݚڀܭը • αϯϓϦϯάɾ΢Σοτ࣮ݧ • Ұ࣍σʔλ • σʔλॲཧɾσʔλղੳ • ೋ࣍σʔλ • ࿦จ͓Αͼ࿦จͷfig, plot΍tableͳͲͷσʔλ
  19. The future #1 Packaging Whole Research Activities Research Activity Time

    Course Details of 
 Project Design Publication
 Text, Figs Sampling Primary Data Data Processing
 & Analysis Wet Experiments
  20. The future #1 Packaging Whole Research Activities Research Activity Time

    Course Details of 
 Project Design Sampling Primary Data Data Processing
 & Analysis Publication
 Text, Figs Wet Experiments BioProject BioSample Genbank, DRA
  21. The future #1 Packaging Whole Research Activities Research Activity Time

    Course Details of 
 Project Design Sampling Primary Data Data Processing
 & Analysis Publication
 Text, Figs Wet Experiments
  22. The future #1 Packaging Whole Research Activities Research Activity Time

    Course Details of 
 Project Design Sampling Primary Data Data Processing
 & Analysis Publication
 Text, Figs Wet Experiments
  23. Details of Project Design Sampling Primary Data Data Processing &

    Analysis Publication Text, Figs Wet Experiments The future #1 Packaging Whole Research Activities
  24. The future #2 Continuous Integration for the Research Process •

    Archived Packagesʹର͢ΔDBͷ ”Continuous Integration” • σʔλͱϓϩτίϧΛϏϧυͯ͠ςετɺίϛοτ • ࠷ऴίϛοτ͕ͦͷ··DBʹొ࿥͞ΕΔ • ύοέʔδ͸ΦϒδΣΫτͱͯ͠Ϟδϡʔϧ͝ͱʹΞΫηεՄೳ • ʮݚڀAͷσʔλʹݚڀBͷσʔλॲཧΛద༻͢ΔʯΛίϚϯυҰൃͰ • ϨϙδτϦʹొ࿥͞ΕΔ৽نख๏͸طଘͷσʔλશͯʹࣗಈతʹద༻͞ΕΔ • DB͕উखʹ๲ΒΜͰ͍͘ • DBGrowthRate/submission ͕IFʹஔ͖׵ΘΔ • ৽͍͠ԾઆΛূ໌͢ΔͨΊͷ৽نͳख๏ɺ৽نͳαϯϓϧͷՁ஋͕૿େ͢Δ • DataCenter + Database = “Reproduciblity as a Service”
  25. 1. Infrastructure needs to change for the Data Science
 ଟ͘ͷ໰୊Λ๊͑ΔݱࡏͷܭࢉػΠϯϑϥ͸େ͖͘มΘΔඞཁ͕͋Δ

    2. Virtualized Env Runs on Abstracted Computational Resources
 ந৅Խ͞Εͨϋʔυ΢ΣΞͰԾ૝Խ͞Εͨ؀ڥΛՔಇ͢Δ 3. DB Integrates Data and Processes for Reproducible Research
 σʔλϕʔε͕ݚڀΛ౷߹͠࠶ݱੑΛ୲อ͢Δ Summary
  26. • This work was supported by ROIS URA Grant “༥߹γʔζ୳ࡧ”

    2014. • The Institute of Statistical Mathematics • Dr. Yoshiyasu Tamura • Dr. Junji Nakano • Dr. Keisuke Honda • National Institute of Informatics • Dr. Kenjiro Yamanaka • Dr. Kento Aida • Dr. Shigetoshi Yokoyama • Dr. Yoshinobu Masatani • National Institute of Genetics • Dr. Osamu Ogasawara • Dr. Takeshi Tsurusawa • NIG SuperComputer Facilities SE team • Information and Mathematical Science and Bioinformatics Co., Ltd. • Tatsuya Nishizawa • Tokyo Institute of Technology • Dr. Shinichi Miura • Dr. Satoshi Matsuoka • Colleagues and Members of DBCLS, DDBJ, Open-Bio and BioHackathon Acknowledgement