ETLアプローチに基づくゲノム解析パイプラインの開発

Extrac'on Transforma'on Load (ETL)アプローチに基づくがんゲノム解析パイプラインの開発　国立がん研究センターがんゲノム情報管理センター　ゲノム解析室　白石　友一

自己紹介東大医科研で７年間がんゲノム解析プラットフォームを開発しておりました。６月１日からがんセンターに新しく発足した、がんゲノム情報管理センターで、ゲノム解析室の室長になりました。 Kataoka et al., Nature,
2016 ded uta- able ene ons un- iple uta- 12 gets edi- ving ause ding and d in g. 1, nge- tein ll as the itial NA nda U2 F2)– site o be ning 1)19. cing SR2 th a ome with three additional spliceosome-related genes, including U2AF65, SF1 and SRSF1, in a large series of myeloid neoplasms (N 5 582) using a high-throughput mutation screen of pooled DNA followed by con- firmation/identification of candidate mutations (refs 21 and 22 and Supplementary Methods II). In total, 219 mutationswere identified in 209 out ofthe582 specimens of myeloid neoplasms through validating 313 provisional positive events in the pooled DNA screen (Supplementary Tables 4 and 5). The mutations among four genes, U2AF35 (N5 37), SRSF2 (N5 56), ZRSR2 (N 5 23) and SF3B1 (N5 79), explained most of the mutations with much lower mutational rates for SF3A1 (N 5 8), PRPF40B (N5 7), U2AF65 (N 54) and SF1 (N 5 5) (Fig. 2). Mutations of the splicing machinery were highly specific to diseases showing myelodysplastic fea- tures, including MDS either with (84.9%) or without (43.9%) increased ring sideroblasts, chronic myelomonocytic leukaemia (CMML) (54.5%), and therapy-related AML or AML with myelodysplasia-related changes (25.8%), but were rare in de novo AML (6.6%) and myeloproliferative neoplasms (MPN) (9.4%) (Fig. 3a). The mutually exclusive pattern of the mutations in these splicing pathway genes was confirmed in this large case series, suggesting a common impact of these mutations on RNA splicing and the pathogenesis of myelodysplasia (Fig. 3b). The frequencies of mutations showed significant differences across disease types. Surprisingly, SF3B1 mutations were found in the majority of the cases with MDS characterized by increased ring sideroblasts, that is, refractory anaemia withring sideroblasts(RARS)(19/23 or 82.6%)and refractory cytopenia with multilineage dysplasia with $ 15% ring sideroblasts (RCMD-RS) (38/50 or 76%) with much lower mutation frequencies in other myeloid neoplasms. RARS and RCMD-RS account P to F65, U2AF35 (21q22.3) Zn UHM RS 240 aa Zn S34F(20) S34Y(5) Q157R(7) Q157P(4) ZRSR2 (Xp22.1) Zn UHM RS Zn N382K* C302R H330R N261Y I202N 483 aa I53T* N327fs G323fs W291X L237fs S40X A96fs R126X E118fs R68sp K257sp F239V E362X E148X E133G C326R PRPF40B (12q13.12) 871 aa SF3A1 Surf UbqL Surf (22q12.2) A57S I141M* Y772C 793 aa E373D T374P K166T M117I M667V RRM RS P95H(31)/L(14)/R(11) SRSF2 (17q25.1) 221 aa Y347X A26V P383L FF FF P15H* P540S D442N M58I* P212L* PR WW WW SF3B1 (2q33.1) 1,304 aa K700E(44) HD K666N(6)/T(3)/E(2)/R(2) H662Q(8)/D(2) E622D(4) Y623C R625L(2)/C(1) N626D K182E G347V D781G U2AF65 (19q13.42) UHM RS M144I R18W 475 aa L187V UHM UHM SF1 KH PR (11q13.1) Zn T474A A508G G372V Y476C T454M HD HD HD HD HD HD HD HD HD HD ARTICLE RESEARCH Yoshida et al., Nature, 2011

本発表の流れ 1.  大規模がんゲノムデータ解析 2.  Genomon2について (non-cloud based) 3.  On Demand
Extraction Transformation Load アプローチについて 4.  Genomon3?について（cloud based)

大規模がんゲノムデータ解析

Pan cancer study Pan-Cancer Atlas ICGC PCAWG (Pan-Cancer Analysis of
Whole) •  11,000 tumors data from 33 cancer types •  Focused on exome data •  >27 papers •  >2,600 tumors data from 39 cancer types •  Focused on whole genome •  On going

免疫を逃れる仕組み •  PD-1/PD-L1による免疫の回避（がん細胞におけるPD-L1の活性化） •  抗PD-1抗体によるがん治療革命 •  なぜがん細胞でPD-L1が活性化されるかは分かっていなかった杉山大介・西川博嘉, 領域融合レビュー, 4,
e005 (2015)

ATLにおけるPD-L1の3’UTRおけるSVの多発 •  成人T細胞白血病(adult T-cell Leukemia)の49検体の全ゲノム解析(Kataoka et al., Nature
Gene'cs, 2015)において，PD-L1の3’UTR領域に複数検体でSVを検出 •  約27%の患者で検出． •  SVの切断点の向こう側の位置は様々であった． •  ATLとは •  HTLV-1ウイルス感染を原因とする白血病・悪性リンパ腫である． •  日本では沖縄県と南九州，海外では中南米諸国に多い． Kataoka, Shiraishi, Takeda et al., Nature, 2016

3’UTRにおけるSVによる PD-L1の活性化メカニズム片岡圭亮ライフサイエンス　新着論文レビュー DOI: 10.7875/ﬁrst.author.2016.050

TCGAデータの大量解析 •  ~10,000検体のTCGA の解析（HGCスパコン） •  以下をシームレスに実行するフレームワークの構築． 1.  サンプルのダウンロード 2.  検体のチェック（サンプルのQC，シングルエンド？ペアエンド？）
3.  サンプルシート作成 4.  Genomon2 RNA（のvariant）での実行

TCGAでのスクリーニング結果 •  PD-L1の発現が極端に高い検体では，ほぼSVが見られる． •  PD-L1の高発現のメカニズムにSVは重要であることが示唆される． •  SVだけではなく、HPVやEBウイルスの挿入も見られた。 •  頻度 • 
B細胞リンパ腫：8%，胃がん：2% Kataoka, Shiraishi, Takeda et al., Nature, 2016

スプライシング変異の網羅的検出 •  Intronの端の2bp (GT – AG)は、splicingの制御に非常に重要。 –  遺伝性の疾患では、GT-AG以外の場所以外の変異も重要ということが知られている。
•  スプライシング変異を網羅的に検出する統計手法を開発した。 •  提案手法を~9000検体のがんゲノム・トランスクリプトームシークエンスデータに適用。 Shiraishi et al., Biorxiv, 2017, Genome Research, accepted

スプライシング変異の置換パターン •  Splicing donor mo'f •  Essen'al splice site (GT)の他に、exonの最後、intronの5bpに集中。
•  Splicing acceptor mo'f •  多くはessen'al splice site (AG)に集中している。 Shiraishi et al., Biorxiv, 2017, Genome Research, accepted Exon Exon Intron G G T C/T G N C/T C/T A A/G A G T G A C/T Donor disrup'on Acceptor disrup'on

スプライシング変異が頻発する遺伝子 •  スプライシング変異が頻発する遺伝子は、がん遺伝子に集中。 •  多くのがん遺伝子においてGT-AG以外の変異が同定された。

Pan-cancer解析について •  がん種横断的な解析により、 –  「ある一つのがんで見つかった現象が、どのくらい一般的か？」がわかる。 –  新しい現象の発見にもつながる。 –  たくさんの検体を扱うことで、S/N比が高くなり、重要
な現象、変異の発見につながる。 •  たくさんのデータを扱うので、計算環境の構築が重要になる。 –  データのダウンロードなどが大変。。。 –  できる限りの自動化が必要。

これまでのシークエンス解析モデル Standard Model of Computational Analysis Local Data U N
I V E R S I T Y U N I V E R S I T Y Locally Developed Software Publicly Available Software Local storage and compute resources Network Download Public Data hgps://www.genome.gov/mul'media/slides/tcga4/23_davidsen.pdf

これまでの解析モデルの問題点 •  公共データの大規模解析 – TCGAのデータが全部で2.5PB　(2015, 5月時点） •  RNA-seqのbamファイルだけで、約70TB – まずダウンロードが大変。。。 – ミラーサイトの構築が技術的、倫理的に難しい。 • 
それぞれの研究グループで、TCGAのデータの利用申請が必要（使い回しができない）。 •  TCGAとの交渉が必要？？ – 規模の大きい研究室だけしか、大規模解析ができない。。。。

クラウドを通じた解析モデル Co-located Compute & Data API Data Access Security Resource
Access Core Data (TCGA) User Data Computational Capacity Standard tools User uploaded tools hgps://www.genome.gov/mul'media/slides/tcga4/23_davidsen.pdf データのダウンロードの必要がなくなり、誰もが大規模ゲノムデータにアクセス可能に！

Democra'ze Cancer Genomics! •  NCI cloud pilot – ３つの研究機関でモデルケースの開発 – 独占が生じないよう
に。。。 www.isb-cgc.org Institute for Systems Biology The goals of the NCI Cloud Pilots are to democratize access to NCI-generated genomic and related data, and to create a cost-effective way to provide scalable computational capacity to the cancer research community. The Institute for Systems Biology (ISB) Cloud provides interactive and programmatic access to data, leveraging many aspects of the Google Cloud Platform. The interactive ISB-CGC web-app allows scientists to interactively define and compare cohorts, examine underlying molecular data for specific genes or pathways of interest, and share insights with collaborators. For computational users, programmatic interfaces and GCP tools such as BigQuery, Genomics, and Compute Engine allow users to perform complex queries from R or Python scripts, or run Dockerized workflows on sequence data available in cloud storage. www.isb-cgc.org Institute for Systems Biology Seven Bridges Genomics www.cancergenomicscloud.org The goals of the NCI Cloud Pilots are to democratiz genomic and related data, and to create a cost-effec computational capacity to the cancer rese The Institute provides inte data, leveragi Cloud Platfor allows scienti compare coh data for speci and share ins computationa and GCP tool Compute Eng queries from Dockerized w in cloud stora Seven Bridge Cloud enable analysis of lar secure, repro rich query sy exact data of own private d Common Wo makes it easy bench biologi reproducible genomics dat www.cancergenomicscloud.org Broad Institute www.firecloud.org own private Common W makes it ea bench biolo reproducib genomics d Broad Insti Firehose an facilitates c scalable pla at-large. Us Google Clou tool develo perform lar curation, an upload thei workspaces tools and p

“bring the analysis to the data” 19 •  大量のシークエンスデータをダウンロードして解析することが不可能になりつつある。 • 
GA4CHでの議論において、データの提供者はデータを配置するだけではなく、解析するための「環境」を整備することが求められるとされている。 •  IaaS型クラウド環境であり、解析者が自前のワークフローを実行できること。 •  データのセキュリティーを「安全」に保つこと。 •  データの利用者が容易に「課金」できるシステムを提供すること。 Data Bio-sphere; by Benedict Paten

GENOMON2について

Genomon-exome (2012年）

Nature 2 Science 1 N Engl J Med 1 Nature
gene'cs 8 Blood 6 Nat Commun 2 他多数京都大学小川誠司研究室における成果 LETTER doi:10.1038/nature18294 Aberrant PD-L1 expression through 3′-UTR disruption in multiple cancers Keisuke Kataoka1*, Yuichi Shiraishi2*, Yohei Takeda3*, Seiji Sakata4, Misako Matsumoto3, Seiji Nagano5, Takuya Maeda5, Yasunobu Nagata1, Akira Kitanaka6, Seiya Mizuno7, Hiroko Tanaka2, Kenichi Chiba2, Satoshi Ito2, Yosaku Watatani1, Nobuyuki Kakiuchi1, Hiromichi Suzuki1, Tetsuichi Yoshizato1, Kenichi Yoshida1, Masashi Sanada8, Hidehiro Itonaga9, Yoshitaka Imaizumi10, Yasushi Totoki11, Wataru Munakata12, Hiromi Nakamura11, Natsuko Hama11, Kotaro Shide6, Yoko Kubuki6, Tomonori Hidaka6, Takuro Kameda6, Kyoko Masuda5, Nagahiro Minato13, Koichi Kashiwase14, Koji Izutsu15, Akifumi Takaori-Kondo16, Yasushi Miyazaki10, Satoru Takahashi7, Tatsuhiro Shibata11,17, Hiroshi Kawamoto5, Yoshiki Akatsuka18,19, Kazuya Shimoda6, Kengo Takeuchi4, Tsukasa Seya3, Satoru Miyano2 & Seishi Ogawa1 Successful treatment of many patients with advanced cancer using antibodies against programmed cell death 1 (PD-1; also known as PDCD1) and its ligand (PD-L1; also known as CD274) has highlighted the critical importance of PD-1/PD-L1-mediated immune escape in cancer development1–6. However, the genetic basis for the immune escape has not been fully elucidated, with the exception of elevated PD-L1 expression by gene amplification and utilization of an ectopic promoter by translocation, as reported in Hodgkin and other B-cell lymphomas, as well as stomach adenocarcinoma6–10. Here we show a unique genetic mechanism of immune escape caused by structural variations (SVs) commonly disrupting the 3′ region of the PD-L1 gene. Widely affecting multiple common human cancer types, including adult T-cell leukaemia/lymphoma (27%), diffuse large B-cell lymphoma (8%), and stomach adenocarcinoma (2%), these SVs invariably lead to a marked elevation of aberrant PD-L1 transcripts that are stabilized by truncation of the 3′-untranslated region (UTR). Disruption of the Pd-l1 3′-UTR in mice enables immune evasion of EG7-OVA tumour cells with elevated Pd-l1 expression in vivo, which is effectively inhibited by Pd-1/Pd-l1 blockade, supporting the role of relevant SVs in clonal selection through immune evasion. Our findings not only unmask a novel regulatory mechanism of PD-L1 expression, but also suggest that PD-L1 3′-UTR disruption could serve as a genetic marker to identify cancers that actively evade anti-tumour applied to a set of WGS data from 49 cases of adult T-cell leukaemia/ lymphoma (ATL), a retrovirus-associated aggressive peripheral T-cell neoplasm15. RNA sequencing (RNA-seq) data were also available for 43 samples (Extended Data Fig. 1a and Supplementary Table 1). Genome-wide mapping of SV-associated breakpoints revealed a number of recurrent breakpoint cluster regions. Among these, the most prominent corresponded to breakpoints at chromosome 9p24.1 found in 13 (26.5%) samples, which were narrowly clustered in a 3.1 kilobase (kb) region within the 3′ region of the PD-L1 locus (Extended Data Fig. 1b and Supplementary Table 2). Depending on samples, a variety of SV types were observed, including a large deletion (n = 1), tandem duplications (n = 4), inversions (n = 4), and translocations (n = 4) (Fig. 1a and Extended Data Fig. 1c). However, irrespective of underlying SV types, an aberrant PD-L1 allele was generated in all cases, where the authentic 3′ exons were replaced by an ectopic sequence derived from the rearranged loci (n = 12) or a short 327 base pair (bp) sequence within the last exon was inverted (ATL017). It was appar- ent that these SVs were invariably associated with markedly elevated expression of PD-L1, except for a single case (ATL068) with very low tumour content (Fig. 1b). As expected from the underlying SV structure, all overexpressed PD-L1 transcripts underwent structural alterations, which, on the basis of RNA-seq, fused varying lengths of the 5′ region of the PD-L1 sequence to a short tract of intronic or

(FOPNPO 5IF;FOPG $BODFS(FOPNF"OBMZTJT

(FOPNPO

Genomon2について •  Genomon DNA –  WGS, WES, targetに対応 –  SNV,
indel, SVの検出 •  FLT3-ITDも検出可能 •  Genomon RNA –  融合遺伝子検出 –  発現量算出 •  インタラクティブレポートの自動生成 •  東大医科研宮野悟研究室、京都大学医学系研究科の小川誠司研究室で共同開発 •  国内では多くのユーザー –  京都大学医学系研究科 –  東大医科研 –  東大小児科 –  九州大学別府病院

paplotによる解析結果の表示 Okada et al., JOSS, 2017, hgps://github.com/Genomon-Project/paplot

Genomon2の反省 •  Job management systemの選定 (ruﬀus) –  多くのバグ。 •  とりあえずバージョン固定(2.6.3)なら、手動の処理を入れると安定
的に動いている。。 –  開発がアクティブでない。 –  メモリ使用量が高い。。。 •  Univa Grid Engineへの依存 –  一応DRMAAを使って、解析ワークフローの移植が容易になるように頑張った。。。 •  やはりUGE依存の記述が取りきれない。。 •  多くのライブラリ、ソフトウェアへの依存 –  ユーザーが一通り準備する必要がある。

Genomonの依存パッケージ 28 •  Python (2.7.10) •  Perl (5.14.4) •  R
(3.3.1) •  bwa (0.7.8) •  blat (v34) •  samtools (1.2) •  Biobambam (0.0.191) •  PCAP-core (20150511) •  htslib (1.3) •  bedtools (2.24.0) •  GenomonPipeline (2.5.3) •  GenomonSV (0.4.2rc) •  GenomonFisher (0.2.0) •  GenomonMuta'onFilter (0.2.1) •  EBFilter (0.2.1) •  GenomonPostAnalysis (1.4.0) •  GenomonQC (2.0.1) •  GenomonExpression (0.3.0) •  fusionfusion (0.3.0) •  paplot (0.5.5) •  sv_u'ls (0.4.0b2) •  annot_u'ls (0.1.0) •  fusion_u'ls (0.2.0 膨大な無の基本ライブラリ数 OS 移植にあたりこれらの設定を準備する必要あり！

Genomon 3.0?

ON DEMAND EXTRACTION TRANSFORMATION LOADアプローチについて

On Demand Extrac'on Transfer Load (ETL) approach VM VM VM
仮想マシン領域ストレージ領域 sequence data 1 sequence data 2 sequence data 3 analy'cal result 1 analy'cal result 2 analy'cal result 3 1. Virtual Machine (VM)が立ち上がる 3. VM上のdockerコンテナ上で解析処理 4. 解析結果がVM からストレージに転送される 2. 入力データがストレージからVMに転送される 5. VMが除去される •  dsub (google cloud plarorm) •  Amazon AWS Batch •  Azure Batch Amazon AWS S3 Google Cloud Storage Microsos Azure Storage ポイント •  ストレージ始まり、ストレージ終わりであること。 •  仮想マシンが終わったら除去されること •  Dockerを利用していること。

dsub (hgps://github.com/DataBiosphere/dsub) $ dsub \ --script star-alignment.sh \ --image friend1ws/star-alignment
¥ --tasks cellline.tsv ¥ --disk-size 200 --min-cores 6 --min-ram 36 ¥ --project genomondevel1 ¥ --zones asia-east1-a •  script: 実行スクリプト •  image: Docker image •  task: シェルスクリプトごとの引数の指定（入力ファイル、出力ファイルのストレージ領域上のパスなど）インスタンスのスペックインスタンスのリージョンプロジェクト名

On Demand ETLの実行に必要ファイル usr/local/bin/STAR --genomeDir ${REFERENCE} --readFilesIn ${FASTQ1} ${FASTQ2} ¥
--outFileNamePrefix ${OUTPUT_PREF}. /usr/local/bin/samtools sort -T ${OUTPUT_PREF}.Aligned.sortedByCoord.out ¥ ${OUTPUT_PREF}.Aligned.out.bam -O bam > ¥ ${OUTPUT_PREF}.Aligned.sortedByCoord.out.bam /usr/local/bin/samtools index ${OUTPUT_PREF}.Aligned.sortedByCoord.out.bam シェルスクリプト FROM ubuntu:16.04 RUN apt-get update && apt-get install –y wget bzip2 make gcc zlib1g-dev RUN wget https://github.com/alexdobin/STAR/archive/2.5.3a.tar.gz && ¥ tar xzvf 2.5.3a.tar.gz && ¥ mv STAR-2.5.3a/bin/Linux_x86_64_static/STAR /usr/local/bin . . . Docker image Taskファイル --env SAMPLE --input FASTQ1 --input FASTQ2 --output-recursive OUTPUT_DIR –input REFERENCE MCF-7 gs://input/MCF-7/1.fq gs://input/MCF-7/2.fq gs://output/MCF-7 gs://star_ref K-562 gs://input/K-562/1.fq gs://input/K-562/2.fq gs://output/K-562 gs://star_ref He-la gs://input/He-la/1.fq gs://input/He-la/2.fq gs://output/He-la gs://star_ref

On Demand ETLの利点 •  終わった後にすぐ仮想マシンが落ちるので、コスト削減につながる。 •  シェルスクリプト + Docker
imageによってバッチジョブの定義ができる。 – Docker imageにより、環境がカプセル化され、処理が再現可能になる。 – シェルスクリプトはCWL (Common Workﬂow Language)にそのうち置き換わるかも？

On demand ETLを達成するパッケージ •  ecsub (by ai okada） –  hgps://github.com/aokad/ecsub
–  Amazon ECSを利用 •  本当はAWS Batchを利用したかった。。。 •  azurebatchmon (by kenichi chiba) –  hgps://github.com/Genomon-Project/azurebatchmon –  Microsos Azure Batchを利用。 •  awsub (by Hiromu Ochiai) –  hgps://github.com/o'ai10/awsub –  docker-machineに基づく –  Extended ETLを実装(shared instanceの利用） dsubのバッチジョブ定義とほぼcompa'bleに！！

GENOMON3について

Successive ETL as a Pipeline (SEaaP) VM VM VM 仮想マシン領域
ストレージ領域 fastq 1 fastq 2 fastq 3 VM VM VM fastq 1 fastq 2 fastq 3 bam 1 bam 2 bam 3 vcf 1 vcf 2 vcf 3 bam 1 bam 2 bam 3 bam 1 bam 2 bam 3 vcf 1 vcf 2 vcf 3

genomon_pipeline_cloud (仮）の特徴 •  逐次的にETLジョブを実行。 –  各ステップのOn demand ETLのTaskファイルを動的に生成して実行 – 
分岐処理もmul'processingで処理。 –  各ステップの計算環境が完全にDocker imageでカプセル化されているので、完全に再現可能。 –  各ステップで、使う仮想マシンのスペックの指定が可能。 •  費用の削減につながる。 •  4種類のETL engineに対応 (dsub, ecsub, azurebatchmon, awsub)。 –  抽象クラスの利用で、--engineオプションで切り替え。 –  AWS, Google Cloud Plarorm, Microsos Azureは利用できる。 •  準備・インストールが非常に簡単。 –  genomon_pipeline_cloud自身 –  ETL engine package (dsub, ecsub, azurebatchmon, awsubのどれか) –  追加でクラウド側の準備を少々 •  Genomon2でのモジュールの移植をほぼ完了。 (hgps://github.com/Genomon-Project/genomon_pipeline_cloud)

Genomon pipeline cloud $ genomon_pipeline_cloud ¥ –-engine dsub ¥ sample.csv
¥ output_bucket ¥ param.cfg [general] instance_option = --project genomondevel1 --zones asia-east1-a [star_alignment] resource = --disk-size 200 --min-cores 6 --min-ram 36 image = genomon/star_alignment:0.1.0 star_option = --runThreadN 6 --outSAMstrandField intronMotif star_reference = gs://genomon_rna_gce/db/GRCh37.STAR-2.5.2a [fusionfusion] resource = --disk-size 200 --min-cores 2 --min-ram 8 image = genomon/fusionfusion:0.1.0 reference = gs://genomon_rna_gce/db/GRCh37/GRCh37.fa [fastq] K562,s3://input/K562/1.fastq, s3://input/K562/2.fastq MCF7,s3://input/MCF7/1.fastq, s3://input/MCF7/2.fastq [fusion] K562,None MCF7,None Command sample.csv: 入力シークエンスデータのストレージ領域上のパス param.cfg: 各モジュールのパラメータ dsub awsub, ecsub, azmon

genomon pipeline cloud(仮）の課題 •  Non-matched controlの扱い –  多数のcontrol検体を利用して、偽陽性を除去する処理は、がんゲノムの解析で良く使われる。 – 
多数のBAMファイルを各々のインスタンスにダウンロードすると、転送時間が大変。。。 –  事前にblacklistを決めるなどの処理が必要。 •  複数の変異コールプログラムの利用。 –  多数の変異コールプログラムを用いたメタプログラムが一般的になってきている。 •  解析ワークフローの性能の評価 •  倫理面、セキュリティ面 –  データ転送、計算をいかに「セキュア」に実行するか？ •  名前？？

ETLに基づくゲノム解析プログラムの評価基盤構築 •  ゲノム解析プログラムの評価はとにかく大変。。。。 –  計算リソースの用意が大変。 –  多くのライブラリに依存している。 –  各々のプログラムの出力するファイルのフォーマットがバラバラ。
–  後処理のフィルタリング。 •  解析ワークフローをETLに基づくジョブとして定義（シェルスクリプト＋Docker image） •  テストデータでのETLジョブ実行、評価プログラムによる評価結果の算出、結果の可視化を自動で実行する仕組み。

クラウドを使った新しい解析アプリケーション (SeqPod) •  新しいシークエンスデータ解析の提案 1.  ソフトウェアを自分のローカル環境にダウンロード、インストール。 2.  ソフトウェアを立ち上げる。
3.  シークエンスデータをドラッグ&ドロップ。 4.  バックエンドでAmazonクラウドの仮想マシンが立ち上がり、計算が始まる． 5.  計算が終わると、結果がメールで送られてくる．

まとめ •  クラウドの活用は今後のゲノム研究において必須のものになる。 –  「ゲノムデータ」、「解析ワークフロー」の円滑なシェアリングに不可欠。 •  クラウドを利用して、どのように解析ワークフローを組むかは、簡単な問題ではない。
–  クラウド計算技術の進展はめざましい。 •  サーバーレス(AWS lambda, google cloud func'on) •  Kubernetes –  国際的な様々な動き（Common Workﬂow Languageなど）。 –  楽しい反面、大変。。がんゲノム情報管理センター　ゲノム解析室　室員　募集中！！ •  一緒にゲノムデータの解析のプラットフォームを開発してくれる方。 •  クラウド利用について一緒に悩んでくれる方。

Acknowledgement •  Na'onal Cancer Center –  Kenichi Chiba –  Ai
Okada –  Hiromu Ochiai –  Keisuke Kataoka –  Yasunori Kogure •  Tokyo University, Human Genome Center –  Satoru Miyano •  Kyoto University –  Seishi Ogawa

ETLアプローチに基づくゲノム解析パイプラインの開発

ETLアプローチに基づくゲノム解析パイプラインの開発

More Decks by Yuichi Shiraishi

Other Decks in Science

Featured

Transcript