Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Genomon2_Azure_JSBi _2016

Yuichi Shiraishi
November 24, 2017

Genomon2_Azure_JSBi _2016

Part of Presentation titled "Reproducible and large-scale cancer genome sequencing analysis using Genomon2" at the Microsoft session at JSBi 2016

Yuichi Shiraishi

November 24, 2017
Tweet

More Decks by Yuichi Shiraishi

Other Decks in Science

Transcript

  1. Genomon2 on Microsoft Azure Yuichi Shiraishi Kenichi Chiba The University

    of Tokyo Institute of Medical Science Human Genome Center DNA Information Analysis © 2011 Microsoft Corporation All Rights Reserved.
  2. ͜Ε·ͰͷγʔΫΤϯεղੳϞσ ϧ Standard Model of Computational Analysis Local Data U

    N I V E R S I T Y U N I V E R S I T Y Locally Developed Software Publicly Available Software Local storage and compute resources Network Download Public Data https://www.genome.gov/multimedia/slides/tcga4/23_davidsen.pdf
  3. ͜Ε·ͰͷղੳϞσϧͷ໰୊఺ •  ެڞσʔλͷେن໛ղੳ –  TCGAͷσʔλ͕શ෦Ͱ2.5PBɹ(2015, 5݄࣌ ఺ʣ •  RNA-seqͷbamϑΝΠϧ͚ͩͰɺ໿70TB – 

    ·ͣμ΢ϯϩʔυ͕େมɻɻɻ –  ϛϥʔαΠτͷߏங͕ٕज़తɺྙཧతʹ೉͍͠ɻ •  ͦΕͧΕͷݚڀάϧʔϓͰɺTCGAͷσʔλͷར༻ਃ ੥͕ඞཁʢ࢖͍ճ͕͠Ͱ͖ͳ͍ʣɻ •  TCGAͱͷަব͕ඞཁʁʁ –  ن໛ͷେ͖͍ݚڀ͚͔ࣨͩ͠ɺେن໛ղੳ͕Ͱ͖ ͳ͍ɻɻɻɻ
  4. Ϋϥ΢υΛ௨ͨ͡ղੳϞσϧ Co-located Compute & Data API Data Access Security Resource

    Access Core Data (TCGA) User Data Computational Capacity Standard tools User uploaded tools https://www.genome.gov/multimedia/slides/tcga4/23_davidsen.pdf σʔλͷμ΢ϯϩʔυͷඞཁ͕ͳ͘ͳΓɺ୭΋͕େن໛ήϊϜσʔλʹΞΫηεՄೳʹʂ
  5. Democratize Cancer Genomics! •  NCI cloud pilot – ̏ͭͷݚڀػؔͰ Ϟσϧέʔεͷ։ൃ – ಠ઎͕ੜ͡ͳ͍Α

    ͏ʹɻɻɻ www.isb-cgc.org Institute for Systems Biology The goals of the NCI Cloud Pilots are to democratize access to NCI-generated genomic and related data, and to create a cost-effective way to provide scalable computational capacity to the cancer research community. The Institute for Systems Biology (ISB) Cloud provides interactive and programmatic access to data, leveraging many aspects of the Google Cloud Platform. The interactive ISB-CGC web-app allows scientists to interactively define and compare cohorts, examine underlying molecular data for specific genes or pathways of interest, and share insights with collaborators. For computational users, programmatic interfaces and GCP tools such as BigQuery, Genomics, and Compute Engine allow users to perform complex queries from R or Python scripts, or run Dockerized workflows on sequence data available in cloud storage. www.isb-cgc.org Institute for Systems Biology Seven Bridges Genomics www.cancergenomicscloud.org The goals of the NCI Cloud Pilots are to democratiz genomic and related data, and to create a cost-effec computational capacity to the cancer rese The Institute provides inte data, leveragi Cloud Platfor allows scienti compare coh data for speci and share ins computationa and GCP tool Compute Eng queries from Dockerized w in cloud stora Seven Bridge Cloud enable analysis of lar secure, repro rich query sy exact data of own private d Common Wo makes it easy bench biologi reproducible genomics dat www.cancergenomicscloud.org Broad Institute www.firecloud.org own private Common W makes it ea bench biolo reproducib genomics d Broad Insti Firehose an facilitates c scalable pla at-large. Us Google Clou tool develo perform lar curation, an upload thei workspaces tools and p
  6. Ϋϥ΢υΛ࢖ͬͨղੳʹ͍ͭͯͷ ࿦จ •  STAR + RSEM & kalisto •  1.3$

    per sample •  kalisto •  0.09$ per sample http://biorxiv.org/content/early/2016/07/07/062497 http://biorxiv.org/content/early/2016/07/12/063552
  7. EDITORIAL We too have our wish—to enable peer review in

    the cloud—as we see enormous potential for cloud commons research to improve the precision, transparency and reproducibility of research publications that provide periodic key results from and updated guides to the con- tinuous knowledge production within the data commons. The pub- lications also provide incentive and credit within the wider scientific community, above and beyond the reputation researchers can gain for coding and data deposition within their own commons. In the interest of refining the idea of a publishable unit and using expert review judiciously, some new peer refereeing conventions, tools and cloud pilots are therefore a priority. Unlike supplementary data summaries and disparate data resources, Recent funding initiatives to improve cancer diagnosis and treatment have been likened to a ‘moonshot’ (Nat. Biotechnol. 34, 119, 2016). Although we do not think that the metaphor of a single engineering feat to achieve a defined goal is entirely appropriate to the aim of controlling cancers, the cloud computing infrastructure for the upcoming Genomic Data Commons (https:// gdc.nci.nih.gov/index.html) and the three recently launched cancer cloud pilots (https://cbiit.nci.nih.gov/ncip/nci-cancer-genomics- cloud-pilots) is very much equivalent to building Mission Control to coordinate multifaceted and coherent programs. Not only does a cloud commons give broad access to petabyte data sets, which are beyond the capacity of many research institutes to even download, Peer review in the cloud The migration of cancer genomics data to cloud computing is a great encouragement for data reuse and integration by bioinformaticians and other data symbionts. Because the cloud allows rapid, transparent and reproducible research on large data sets, we are keen to consider articles and analyses submitted to the journal that provide peer referee access to their constituent cloud projects. Ϋϥ΢υ࣌୅ʹ͓͚ΔϨϏϡʔ •  Ϋϥ΢υ؀ڥͰ͸࿦จͰͷղੳΛɺϋʔυ΢ΣΞɺOSɺιϑτ ΢ΣΞɺόʔδϣϯͷґଘؔ܎Λ׬શʹύοέʔδԽ͕ՄೳʹͳΔɻ ReproducibilityΛϨϏϡΞʔ͕֬ೝ͢Δ͜ͱ͕ՄೳʹͳΔɻ
  8. Ϋϥ΢υͱγʔΫΤϯε •  γʔΫΤϯεσʔλΛඋ͑ͨΫϥ΢υ؀ڥ͕ ੔උ͞ΕΔ͜ͱͰɺ୭΋͕େن໛σʔλʹΞ ΫηεՄೳʹͳΔɻ –  NCI cloud pilot • 

    ϋʔυ΢ΣΞɺOSͳͲ΋ࣗ༝ʹࢦఆՄೳͱ ͳΓɺιϑτ΢ΣΞͷόʔδϣϯ؅ཧ΋༰қ ʹͳΔɻ –  ղੳ݁ՌΛreproducibleʹ͢Δ͜ͱ͕ൺֱత༰қ ʹ •  ϓϩάϥϜͷ࠷దԽ͕ͦͷ··μΠϨΫτʹ ܦඅ࡟ݮʹͭͳ͕Δɻ
  9. ౦ژେֶҩՊݚώτήϊϜղੳηϯλʔ(HGC)εύίϯ͔Β Azure ΁ͷϑΝΠϧసૹ଎౓Λௐ΂ͨɽ ࢖༻ͨ͠ίϚϯυ rsync -avc --progress --partial --append {సૹݩ}

    {సૹઌ} αϯϓϧ໊ ϑΝΠϧ໊ ϑΝΠϧαΠζ సૹ଎౓ సૹ࣌ؒ WM1799 1.fastq.gz 4.8G 85.03MB/s 0:00:57 WM1799 2.fastq.gz 5.0G 85.63MB/s 0:00:59 FASTQ (.gz)ϑΝΠϧͷసૹ଎౓ ZR_75_30 1.fastq.gz 3.8G 80.22MB/s 0:00:48 ZR_75_30 2.fastq.gz 3.8G 87.81MB/s 0:00:43
  10. ࢖༻ͨ͠FASTQϑΝΠϧ Pair read read length: 76 total_reads: 153,266,818 (x2) file

    size: 15GB (x2) ࢖༻ͨ͠ΞϥΠϝϯτπʔϧ BWA mem Version: 0.7.8-r455 BWA memͷ࣮ߦ࣌ؒ ߏ੒໊ Real Time CPU Time HGC Shirokane3 (εύίϯ) 18076.274 sec 18061.225 sec Azure D13v2 19414.886 sec 18803.459 sec
  11. AzureͰGenomon2 RNAΛ࣮ߦ •  774ݕମͷࡉ๔ג (Cancer Cell Line Encyclopedia (CCLE))ͷRNA-seq ͔Β༥߹Ҩ఻ࢠͷ

    ݕग़Λߦͬͨɻ •  STAR + fusionfusion (https:// github.com/ Genomon-Project/ fusionfusion)
  12. ࡞੒ͨ͠Azure Virtual Machines Intel Xeon® E5-2673 v3, 2.4GHz (16 CPU

    cores, 112GB) X 13 Intel Xeon® E5-2673 v3, 2.4GHz (8 CPU cores, 28GB) X 9: OSS, MDS (16 CPU cores, 112GB) X 1: MGS MDS OSS Torque Master Torque Worker Lustre Servers Lustre Clients MGS Internet Local PC lustreαʔόɺΫϥΠΞϯτͳͲͷ ߏ੒ςϯϓϨʔτΛ࢖༻ https://azure.microsoft.com/en-us/ marketplace/partners/intel/lustre- cloud-edition-evaleval-lustre-2-7
  13. ؾʹͳΔ͓஋ஈ͸ʁ ղੳ࣌ؒ: 6/25 0:10 ʙ 6/26 13:54 ར༻ྉۚ: 6/25 0:00

    ʙ 6/26 23:59 Resource ࢖༻ྉۚ Data Transfer (ૹ৴) ¥3.4 IP ΞυϨε࣌ؒ ¥31.8 Standard IO ¥7,702.8 Storage τϥϯβΫγϣϯ ¥443.4 ίϯϐϡʔςΟϯά࣌ؒ ¥163,577.2 Total ¥171,758.6 ࣮ߦαϯϓϧ਺ 749 1αϯϓϧ͋ͨΓͷ஋ஈ ¥229.32