Reproducibility as a Service: Virtual Appliance for NGS data analysis

Reproducibility as a Service: Virtual Appliance for NGS data analysis

Docker Container, Apache Mesos, and NIG SuperComputer Facilities. Prepared for 第31回 DDBJing 講習会 in 東京. See Also: http://www.slideshare.net/dritoshi/20150612-ddb-jingjst

991f3366d9cc17386e6a66ef4abc6dbc?s=128

Tazro Inutano Ohta

June 12, 2015
Tweet

Transcript

  1. ܰྔԾ૝؀ڥΛ༻͍ͯNGSσʔλͷղੳ࠶ݱੑΛ୲อ͢Δ Reproducibility as a Service: Virtual Appliance for NGS data

    analysis Tazro Ohta t.ohta@dbcls.rois.ac.jp Database Center for Life Science, Japan prepared for ୈ31ճ DDBJing ߨशձ in ౦ژ 12 Jun. 2015
  2. Tazro Inutano Ohta DBCLS, ROIS twitter.com/inutano github.com/inutano speakerdeck.com/inutano

  3. 1. Challenges
 σʔλαΠΤϯεͰͷΠϯϑϥతॾ໰୊ 2. Technologies
 ৽ͨͳٕज़ʹΑΔ໰୊ղܾ 3. The future
 ͞ΒͳΔ՝୊ͱ໨ࢦ͢ઌ

  4. σʔλαΠΤϯεͰͷΠϯϑϥతॾ໰୊ɼҨ఻ݚεύίϯͷ৔߹ Challenges

  5. 1. Computational Resource Management
 ܭࢉػϦιʔε͕଍Γͳ͍ɼͲΕ͚ͩ͋ͬͯ΋શવ଍Γͳ͍ 2. Sharing Analysis Protocols and

    History
 ղੳϓϩτίϧͷڞ༗ͱ࠶࣮ߦͷίετ͕ߴ͍ɼࣗ෼΋ଞਓ΋࠶ݱͰ͖ͳ͍ Challenges
  6. Challenges #1 Computational Resource Management • ͔ͭͯ࣍ੈ୅ͱݺ͹ΕͨγʔέϯαʹΑΔσʔλͷരൃత૿Ճ • ήϊϜՊֶ͸ܭࢉػͱ਺ֶΛۦ࢖͢ΔσʔλɾαΠΤϯε΁ •

    Ҩ఻ݚεύίϯ͸ήϊϜղੳ޲͚ʹεύίϯγεςϜΛߋ৽ (2012) • !!! ׬શແྉ !!! • Ϣʔβ͕૿͑ɺδϣϒ͕૿͑ɺσʔλ͕૿͑ɺෆຬ΋૿͑ͨ • Ҩ఻ݚϢʔβձʹ͓͍ͯར༻ऀͷཁ๬ΛώΞϦϯά • Ϧιʔε͕଍Γͳ͍ • δϣϒ͕࣮ߦͰ͖ͳ͍ • ετϨʔδ͕଍Γͳ͍ • SGEͷݶք
  7. େֶڞಉར༻ػؔ๏ਓ ৘ใɾγεςϜݚڀػߏ ࠃཱҨ఻ֶݚڀॴ SuperComputer Facilities of National Institute of Genetics

    photo from http://sc.ddbj.nig.ac.jp/index.php/ja-gallery
  8. ૿͑ଓ͚ΔϢʔβ਺ Ҩ఻ݚDDBJ খּݪ͞ΜൃදࢿྉΑΓ

  9. աڈҰϲ݄ͷδϣϒ࣮ߦঢ়گ http://sc.ddbj.nig.ac.jp/util/wk2/wk2-sge_status_UGER_rb.html

  10. ṧഭ͢ΔσΟεΫྖҬ http://sc.ddbj.nig.ac.jp/index.php/ja-nig-statistics

  11. Challenges #1 Computational Resource Management • Ҩ఻ݚεύίϯ͕ཁٻ͞Ε͍ͯΔ͜ͱ • ܭࢉػࢿݯར༻ͷޮ཰͕࠷େԽ͞Ε͍ͯΔ͜ͱ •

    ιϑτ΢ΣΞ։ൃऀʹͱͬͯࣗ༝౓ͷߴ͍؀ڥͰ͋Δ͜ͱ • ܭࢉػͷ஌͕ࣝͳͯ͘΋େن໛ͳσʔλղੳ͕࣮ߦͰ͖Δ͜ͱ • ѻ͏σʔλͷηΩϡϦςΟ͕୲อ͞Ε͍ͯΔ͜ͱ • ओཁͳDB͕࢖͍΍͍͢ঢ়ଶͰΞΫηεͰ͖Δ͜ͱ
  12. Sharing Analysis Protocols and History Challenges #2 • ର࿩తʹεύίϯΛར༻ͨ͠σʔλղੳΛ(ࣗ෼Ͱ|ଞਓ͕)࠶ݱͰ͖ͳ͍ •

    ڞ༻ܭࢉػͰ͸σʔλղੳͷͨΊͷ؀ڥߏஙʹ͔͔Δίετ͕ߴ͘ͳΔ • ؀ڥߏஙʙ࣮ߦ·Ͱͷղੳͷ࠶ݱʹਓखΛཁ͢Δ • ղੳͷ؀ڥઃఆ/࣮ߦཤྺΛڞ༗ɺ࠶࣮ߦ͢ΔͨΊͷํ๏͕౷Ұ͞Ε͍ͯͳ͍ • GalaxyͷHistoryͷΑ͏ͳػೳ͸͋ΒΏΔσʔλղੳʹͱͬͯඞཁ • ݱঢ়Ͱ͸ࣗવݴޠΛհͨ͠৘ใަ׵͕ߦΘΕ͍ͯΔ • Materials and Methods • Supplementary
  13. ڞ༻εύίϯͰར༻Ͱ͖Διϑτ΢ΣΞͷछྨͱόʔδϣϯʹ͸ݶΓ͕͋Δ http://sc.ddbj.nig.ac.jp/index.php/ja-avail-oss or /usr/local/pkg/oss_info.tsv (DDBJ supercomputer)

  14. rootݖݶͳ͠Ͱιϑτ΢ΣΞΛϏϧυ͢Δͷ͸ͭΒ͍ http://qiita.com/inutano/items/990ba12eb4d8220977f2

  15. ౦େઌ୺ݚ ּݪ͞ΜʹΑΔLPM͸ͱͯ΋ศར͕ͩɺͰ͖Ε͹؀ڥΛؙ͝ͱߏங͍ͨ͠ http://www.kasahara.ws/lpm/

  16. ࠶ݱ͢Δσʔλղੳͷॏཁੑ͸ਵ෼લ͔Βڣ͹Ε͍ͯΔ http://blogs.biomedcentral.com/gigablog/2012/12/16/gigascience-special-session-at-iscb-asia-on-workflows-cloud-for-reproducible-bioinformatics/

  17. ൓෮Մೳੑͱ࠶ݱੑɼίϯηϓτʹ͍ͭͯͷࢦఠ http://www.slideshare.net/sjcockell/reproducibility-the-myths-and-truths-of-pipeline-bioinformatics

  18. େܕϓϩδΣΫτͰ͸ৄࡉͳϓϩτίϧΛެ։͢Δ͜ͱ͕౰ͨΓલʹ http://fantom.gsc.riken.jp/5/sstar/Protocols:HeliScopeCAGE_read_alignment

  19. ΤϯδχΞϦϯάͰղܾͰ͖Δ໰୊Λ ιʔγϟϧʹղܾͨ͠Βෛ͚ Dr. Itoshi Nikaido

  20. ΞϓϦέʔγϣϯͱΠϯϑϥΛܨ͙ɼ৽͍ٕ͠ज़ʹΑΔ໰୊ղܾ Technologies

  21. 1. Virtual Appliance: Linux Containers
 ίϯςφܕԾ૝ͱʮ࣮ߦՄೳͳܭࢉػ؀ڥʯͷهड़ 2. Apache Mesos: Abstract

    Computational Resources
 ܭࢉػࢿݯͷந৅ԽʹΑΔʮόʔνϟϧͳڊେσʔληϯλʔʯͷ࣮ݱ 3. Data Handling and Workflow Execution
 ڊେσʔλͱϫʔΫϑϩʔ࣮ߦɼήϊϜՊֶͷཁٻΛຬٕͨ͢ज़ Technologies
  22. Technologies #1 Virtual Appliance: Linux Containers • Linux Containers (LXC)

    • ෳ਺ͷಠཱͨ͠LinuxγεςϜ(ίϯςφ)Λ1ͭͷLinuxϗετ্ͰՔಇͤ͞Δ • ϋΠύʔόΠβΛՔಇͤͣ͞ɺϦιʔεΛϗετ͔Βಠཱͤ͞Δ • cgroupʹΑΔCPU, memory, block I/O, network, namespaceͷ෼཭ • ͜Ε·ͰͷίϯςφܕԾ૝ͷྲྀΕΛ἞Ήٕज़ • IBM Mainframes LPARɼParallels VirtuozzoɼSoralis Containers • http://en.wikipedia.org/wiki/LXC
  23. LinuxContainers.org: LXCͱͦͷपลπʔϧΛ։ൃ͢ΔOSSϓϩδΣΫτ http://linuxcontainers.org/

  24. Technologies #1 Virtual Appliance: Linux Containers To Docker • Docker

    • ίϯςφܕԾ૝Խͷٕज़͔ΒϓϥοτϑΥʔϜ΁ • LXCʹAUFS (Another Unionfs) Λ૊Έ߹ΘͤͯίϯςφΠϝʔδΛ੍ޚ͢Δ • ϑΝΠϧγεςϜʹࠩ෼σʔλΛॏͶಁաతʹѻ͏ٕज़ • ݱࡏͷDocker (v0.9Ҏ߱) ͸ LXCͰ͸ͳ͘Docker ಠࣗͷ libcontainer Λ࠾༻ • DockerʹΑͬͯίϯςφԾ૝͕τϨϯυʹ • www.docker.com •
  25. Docker

  26. https://www.docker.com/whatisdocker/

  27. Dockerfile: ίϯςφͷϏϧυखॱΛίʔυͰهड़͢Δ (Infrastructure as Code) http://hub.docker.com/u/inutano/fastqc/

  28. Dockerfile: ίϯςφͷϏϧυखॱΛίʔυͰهड़͢Δ (Infrastructure as Code) http://hub.docker.com/u/inutano/fastqc/ $ nano Dockerfile #

    edit Dockerfile $ docker build -t inutano/fastqc . $ docker run -it inutano/fastqc fastqc sample.fastq
  29. Technologies #2 Apache Mesos: Abstract Computational Resources • ίϯςφΛεύίϯͰಈ͔ͨ͢Ίʹ͸ϦιʔεϚωʔδϟͱεέδϡʔϥ͕ඞཁ •

    ैདྷܕͷδϣϒ෼ࢄ࣮ߦΤϯδϯͰ͋ΔSGEͰ͸ॊೈੑʹ΍΍೉͋Γ • ΍ͬͯΔͱ͜Ζ΋͋Δ: http://www.nextflow.io/ • ಈతͳεέʔϧΞ΢τ΍εφοϓγϣοτʹରԠ͍ͨ͠ • Apache Mesos http://mesos.apache.org/ • ෳ਺ͷDC/ܭࢉϊʔυΛଋͶͯ1ͭͷϚγϯͷΑ͏ʹݟͤΔٕज़ • CPU, memory, storageͳͲΛந৅Խ͢Δ • ैདྷͷMPI͚ͩͰͳ͘Hadoop΍SparkͳͲͷ৽͍͠෼ࢄॲཧʹ΋ରԠ • ಠࣗͷεέδϡʔϥΛ࣋ͨͣϑϨʔϜϫʔΫʹର͠Ϧιʔε഑෼ͷAPIΛఏڙ
  30. Apache Mesos: ෳ਺ͷܭࢉϊʔυ/σʔληϯλʔΛଋͶͯԾ૝తͳ1ͭͷܭࢉػʹ͢Δ https://mesosphere.com/learn/

  31. Apache Mesos: UC Berkeley amplabͰܭࢉػΫϥελͷϦιʔεγΣΞͷͨΊʹ։ൃ https://amplab.cs.berkeley.edu/projects/mesos-dynamic-resource-sharing-for-clusters/

  32. Apache Mesos: ࿦จ΋Φʔϓϯʹͳ͍ͬͯΔ (ͦͷޙApacheϓϩδΣΫτ΁) http://mesos.berkeley.edu/mesos_tech_report.pdf

  33. Apache Mesos: Slave͕Masterʹར༻ՄೳϦιʔεΛ஌ΒͤMaster͕FWʹOffer͢Δ http://mesos.apache.org/documentation/latest/mesos-architecture/

  34. Technologies #2 Apache Mesos: Various MiddleWares for Scheduling Containers •

    Mesos্Ͱಈ͘൚༻ͷεέδϡʔϥ͕ෳ਺։ൃ͞Ε͍ͯΔ • Chronos • Mesosʹ͓͚ΔCronΛ࣮ݱ͢Δεέδϡʔϥ • Marathon • Mesosʹ͓͚ΔLong-RunningͳΞϓϦέʔγϣϯͷͨΊͷεέδϡʔϥ • Apache Aurora • ChronosͱMarathonͷಛ௃Λซͤ࣋ͪPythonϕʔεͷDSLΛ࣋ͭ • Mesos schedular͸αϧͰ΋ࣗ࡞Ͱ͖Δ • http://www.slideshare.net/wallyqs/mesos-scheduler
  35. Chronos: MesosͰ؅ཧ͞ΕͨϦιʔεͷ্ͰCronͷΑ͏ʹৼΔ෣͏ϑϨʔϜϫʔΫ http://mesos.github.io/chronos/

  36. Chronos: AirBnB͕։ൃͨ͠Mesosந৅ԽΧʔωϧʹ͓͚ΔcronɼREST APIͰδϣϒ࣮ߦՄ http://mesos.github.io/chronos/

  37. Marathon: ChronosΛܦ༝ͨ͠λεΫͷӬଓԽɼMesos্ͰRoRͳͲͷWeb FWΛՔಇ https://mesosphere.github.io/marathon/

  38. Marathon: Mesos MasterͳͲͱಉ༷ init ͳͲͰىಈɼGUI͔Β؆୯ʹapp͕εέʔϧ͢Δ https://mesosphere.github.io/marathon/

  39. Apache Aurora: Quota΍Multi Userʹ΋ରԠɼ࣮ߦ໋ྩΛهड़͢ΔDSLΛ࣋ͭ http://aurora.apache.org

  40. Apache Aurora: Job, Task, ProcessͦΕͧΕʹ͍ͭͯpythonϕʔεͷDSLͰهड़͢Δ ͨͩ͠resource allocation͸task͝ͱͳͷͰͪΐͬͱ࢖͍ͮΒ͍?

  41. Mesosphere: Apache Mesosؔ࿈ͷOSSʹίϛοτ͠αʔϏεΛఏڙ͢Δاۀ https://mesosphere.com/about/

  42. Mesosphere: DataCenter Operating System (DCOS)Λ։ൃ https://mesosphere.com/learn/

  43. Technologies #3 Data Handling and Workflow Execution • ίϯςφ࣮ߦʹΑΔ໰୊ͱͯ͠σʔλӬଓੑͷ໰୊͕͋Δ •

    ϗετͷϑΝΠϧγεςϜ্ͷσΟϨΫτϦΛϚ΢ϯτͯ͠ॻ͖ࠐΈ • ෳ਺ͷԕִDCΛMesosͰந৅Խͨ࣌͠ʹσʔλΛͲͷΑ͏ʹѻ͏͔ʁ • Cloud Burst Buffer by TITech • MMCFTP by NII • ίϯςφΛ૊Έ߹ΘͤͨϫʔΫϑϩʔΛͲͷΑ͏ʹ࣮ߦ͢Δ͔ʁ • ղੳϫʔΫϑϩʔهड़ͷඪ४Խ • Common Workflow Language • https://github.com/common-workflow-language/common-workflow-language
  44. MMCFTP: Massively MultiConnection FTP developed by NII ϞχλࢀՃத http://ci.nii.ac.jp/naid/110009886191

  45. Cloud-based I/O Burst Buffer: Ϋϥ΢υ্ͷσʔλʹεύίϯ͔Βߴ଎ʹΞΫηε͢Δ ౷ܭ਺ཧݚڀॴ ެ։ߨԋձ 2014 (http://www.ism.ac.jp/kouenkai/) ౦޻େদԬઌੜͷߨԋεϥΠυΑΓҾ༻

  46. Common Workflow Language: OSSͰ։ൃ͞ΕΔඪ४తͳϫʔΫϑϩʔهड़ͷͨΊͷFW https://github.com/common-workflow-language/common-workflow-language

  47. NIGεύίϯͷطଘ؀ڥ͔Β੾Γ཭ͨ͠ϊʔυͰDocker Container on MesosΛ࣮ߦ͢Δ Implementation for proof-of-concept

  48. Apache Mesos + Chronos manager Node Node Node Node Node

    1VCMJD1SJWBUF %PDLFS3FHJTUSZ %PDLFSGJMFT XPSLGMPXKTPO %BUB Storage post post/get transfer push pull run mount 6TFS γεςϜུ֓
  49. workflow.json workflow.sh PAST FUTURE • Post to GridEngine • Run

    binary software • Pre-install/build required • Post to workflow manager • Run docker container • Improved portability
  50. workflow.json • JSON format configuration file • Describe a workflow

    contains multiple steps • 1 container for 1 app • Include directory to be mounted on containers
  51. None
  52. chronos dependent jobs curl -X POST -d @workflow.json • Repeat

    1 • Shipped with suicide job • containers should be finished in a week
  53. Implementation for proof-of-concept Challenges • εέδϡʔϥͷ໰୊ • ChronosͷݶքɼAuroraͷλεΫϞσϧͷήϊϜσʔλղੳͱͷෆ੔߹ • !!!

    ScalaͰεέδϡʔϥॻ͔͘͠ͳ͍ !!! • ϫʔΫϑϩʔهड़ͷ໰୊ • ਓ͕ؒॻ͖΍͍͢͜ͱ vs ަ׵ՄೳͰ࠶ݱੑΛอূͯ͘͠ΕΔ͜ͱ • DSL΍alt-CWLʹඋ͑ͨަ׵ϑΥʔϚοτ/πʔϧͷඞཁੑ • ಛघͳΞʔΩςΫνϟΛར༻͢Δίϯςφͷ໰୊ • MPI, GPGPU΍FPGA • UGEͳͲͷطଘͷεέδϡʔϥ΍HadoopͳͲͷ෼ࢄॲཧϑϨʔϜϫʔΫͱͷڞଘ • ಈతͳεέʔϧΞ΢τͱνΣοΫϙΠϯτͷӡ༻
  54. ͞ΒͳΔ՝୊ͱ໨ࢦ͢ઌɿ࠶ݱੑΛ୲อ͢ΔͨΊͷDBͷࡏΓํ The future

  55. 1. Packaging Whole Research Activities
 ݚڀͷաఔΛશͯػցՄಡͳܗࣜͰهड़͠ΞʔΧΠϒ͢Δ 2. Continuous Integration and

    Automated Build
 DB͸ݚڀ࠶ݱੑΛ୲อ͢ΔͨΊͷϓϥοτϑΥʔϜ΁ The future
  56. The future #1 Packaging Whole Research Activities • ݚڀϓϩηεʹ͓͚Δશͯͷ࡞ۀΛػցՄಡͳܗࣜͰهड़͢Δ •

    هड़͞Εͨ΋ͷ͸ͦͷؔ܎ੑΛอͬͨ··σʔλϕʔεʹ֨ೲ͢Δ • “Research Package Submission” • ݚڀܭը • αϯϓϦϯάɾ΢Σοτ࣮ݧ • Ұ࣍σʔλ • σʔλॲཧɾσʔλղੳ • ೋ࣍σʔλ • ࿦จ͓Αͼ࿦จͷfig, plot΍tableͳͲͷσʔλ
  57. The future #1 Packaging Whole Research Activities Research Activity Time

    Course Details of 
 Project Design Publication
 Text, Figs Sampling Primary Data Data Processing
 & Analysis Wet Experiments
  58. The future #1 Packaging Whole Research Activities Research Activity Time

    Course Details of 
 Project Design Sampling Primary Data Data Processing
 & Analysis Publication
 Text, Figs Wet Experiments BioProject BioSample Genbank, DRA
  59. The future #1 Packaging Whole Research Activities Research Activity Time

    Course Details of 
 Project Design Sampling Primary Data Data Processing
 & Analysis Publication
 Text, Figs Wet Experiments
  60. SMART Protocols: ࣮ݧϓϩτίϧΛػցՄಡͳܗࣜͰهड़͢Δ http://ceur-ws.org/Vol-1282/lisc2014_submission_2.pdf

  61. The future #1 Packaging Whole Research Activities Research Activity Time

    Course Details of 
 Project Design Sampling Primary Data Data Processing
 & Analysis Publication
 Text, Figs Wet Experiments
  62. Details of Project Design Sampling Primary Data Data Processing &

    Analysis Publication Text, Figs Wet Experiments The future #1 Packaging Whole Research Activities
  63. The future #2 Continuous Integration for the Research Process •

    Archived Packagesʹର͢ΔDBͷ ”Continuous Integration” • σʔλͱϓϩτίϧΛϏϧυͯ͠ςετɺίϛοτ • ࠷ऴίϛοτ͕ͦͷ··DBʹొ࿥͞ΕΔ • ύοέʔδ͸ΦϒδΣΫτͱͯ͠Ϟδϡʔϧ͝ͱʹΞΫηεՄೳ • ʮݚڀAͷσʔλʹݚڀBͷσʔλॲཧΛద༻͢ΔʯΛίϚϯυҰൃͰ • ϨϙδτϦʹొ࿥͞ΕΔ৽نख๏͸طଘͷσʔλશͯʹࣗಈతʹద༻͞ΕΔ • DB͕উखʹ๲ΒΜͰ͍͘ • DBGrowthRate/submission ͕IFʹஔ͖׵ΘΔ • ৽͍͠ԾઆΛূ໌͢ΔͨΊͷ৽نͳख๏ɺ৽نͳαϯϓϧͷՁ஋͕૿େ͢Δ • DataCenter + Database = “Reproduciblity as a Service”
  64. BioCI: ݚڀܦաΛৗʹύοέʔδϯά͠ɺΫϥ΢υͰࣗಈϏϧυͱςετΛճ͢ fig: http://blog.jki.net/news/niweek-2012-fire-and-forget-bulletproof-builds-using-continuous-integration-with-labview-video-slides-now-available/
 and togopic http://g86.dbcls.jp/~togoriv/ Do Research Package

    Hosting Build and Test Report / Current Result Collect Data
 and Packaging
  65. 1. Infrastructure needs to change for the Data Science
 ଟ͘ͷ໰୊Λ๊͑ΔݱࡏͷܭࢉػΠϯϑϥ͸େ͖͘มΘΔඞཁ͕͋Δ

    2. Virtualized Env Runs on Abstracted Computational Resources
 ந৅Խ͞Εͨϋʔυ΢ΣΞͰԾ૝Խ͞Εͨ؀ڥΛՔಇ͢Δ 3. DB Integrates Data and Processes for Reproducible Research
 σʔλϕʔε͕ݚڀΛ౷߹͠࠶ݱੑΛ୲อ͢Δ Summary
  66. • This work was supported by ROIS URA Grant “༥߹γʔζ୳ࡧ”

    2014. • The Institute of Statistical Mathematics • Dr. Yoshiyasu Tamura • Dr. Junji Nakano • Dr. Keisuke Honda • National Institute of Informatics • Dr. Kenjiro Yamanaka • Dr. Kento Aida • Dr. Shigetoshi Yokoyama • Dr. Yoshinobu Masatani • National Institute of Genetics • Dr. Osamu Ogasawara • Dr. Takeshi Tsurusawa • NIG SuperComputer Facilities SE team • Information and Mathematical Science and Bioinformatics Co., Ltd. • Tatsuya Nishizawa • Tokyo Institute of Technology • Dr. Shinichi Miura • Dr. Satoshi Matsuoka • Colleagues and Members of DBCLS, DDBJ, Open-Bio and BioHackathon Acknowledgement