Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
HiTSeq '16 slides
Search
verve
July 08, 2016
Research
2
150
HiTSeq '16 slides
Slides from my HiTSeq '16 talk on Rail-RNA (
http://rail.bio
).
verve
July 08, 2016
Tweet
Share
Other Decks in Research
See All in Research
メタ動画データセットによる動作認識の現状と可能性
yuyay
0
180
How to Perform Manual Classification for Deep Learning Using CloudCompare
kentaitakura
0
640
クロスモーダル表現学習の研究動向: 音声関連を中心として
ryomasumura
3
590
東工大Swallowプロジェクトにおける大規模日本語Webコーパスの構築
aya_se
12
6.5k
「歴史的農業環境閲覧システム」と「迅速測図」について
wata909
1
600
Gmail の「メール送信者のガイドライン」強化から 1 ヵ月、今後予想されるメールセキュリティの変化とは
hirachan
1
240
FMP L3 Year 1 Project Proposal
haiinya
0
150
Rの機械学習フレームワークの紹介〜tidymodelsを中心に〜 / machine_learning_with_r2024
s_uryu
0
220
自己教師あり学習による事前学習(CVIMチュートリアル)
naok615
2
1.4k
クリック率を最大化しない推薦システム
joisino
41
14k
論文紹介 DISN: Deep Implicit Surface Network for High quality Single-view 3D Reconstruction / DISN: Deep Implicit Surface Network for High quality Single-view 3D Reconstruction
nttcom
0
110
訓練データ作成のためのCloudCompareを利用した点群の手動ラベリング
kentaitakura
0
540
Featured
See All Featured
For a Future-Friendly Web
brad_frost
172
9k
Large-scale JavaScript Application Architecture
addyosmani
504
110k
Build your cross-platform service in a week with App Engine
jlugia
225
17k
Adopting Sorbet at Scale
ufuk
68
8.6k
A Modern Web Designer's Workflow
chriscoyier
689
190k
Embracing the Ebb and Flow
colly
80
4.1k
Designing on Purpose - Digital PM Summit 2013
jponch
110
6.5k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
322
20k
Building Better People: How to give real-time feedback that sticks.
wjessup
355
18k
Intergalactic Javascript Robots from Outer Space
tanoku
266
26k
VelocityConf: Rendering Performance Case Studies
addyosmani
320
23k
Optimising Largest Contentful Paint
csswizardry
8
2.4k
Transcript
Scalable analysis of RNA-seq splicing and coverage @AbhiNellore at HiTSeq
‘16 Langmead & Leek Labs Johns Hopkins University http://rail.bio
Alignment …ATACATCAGACTAGACCGTACCACTCATAGACCTAGACCAGATACAG… CAGACTAGACCGTACCACTCATAGACCTAGACCAGATAC chr1 Sometimes, a read correctly aligns to
the reference genome end to end. read
Spliced alignment Other times, exon-exon junctions are overlapped. Rail-RNA divides
the read into readlets… ATACATCAGACTAGACCGTACCACACAGCATGACAGTCATTCGACGTACT ATACATCAGACTAGACCGTACCACA ATCAGACTAGACCGTACCACACAGC GACTAGACCGTACCACACAGCATGA AGACCGTACCACACAGCATGACAGT CGTACCACACAGCATGACAGTCATT CCACACAGCATGACAGTCATTCGAC ACAGCATGACAGTCATTCGACGTAC CAGCATGACAGTCATTCGACGTACT ATACATCAGACTAGA ATACATCAGACTAGAC ATACATCAGACTAGACCG ATACATCAGACTAGACCGT ATACATCAGACTAGACCGTAC ATACATCAGACTAGACCGTACAGC AGCATGACAGTCATTCGACGTACT ATGACAGTCATTCGACGTACT GACAGTCATTCGACGTACT ACAGTCATTCGACGTACT AGTCATTCGACGTACT GTCATTCGACGTACT read readlets
Spliced alignment …ATACATCAGACTAGACCGTACCACAGTAGTTCATGACCCTCAGCAGCATGACAGTCATTCGACGTACTCGTATCGATACAGTACAGTAGCC… intron CACAGCATGACAGTCATTCGACGTACTCGTATCGATACAGTACAGTAGCC ATACATCAGACTAGACCGTACCACACAGCATGACAGTCATTCGACGTACT chr1 read 2 needs
realignment to find junction read 1 …and align readlets to the genome to infer introns. Realignment may be necessary.
Why Rail-RNA • Works on many samples, many cores •
Easy to deploy in different computing environments • Borrows strength across samples • Writes many compact, queryable outputs
Many samples, many cores
Scaling Use MapReduce. Example: • Divide computer cluster into workers
controlled by a master • Divide problem up into sequence of aggregation and computation steps
Filter junctions Detect junctions Preprocess reads Align reads with Bowtie
2 / segment into readlets Align readlets with Bowtie 1 Finalize junction combos with Bowtie 2 Enumerate intron configurations Retrieve and index isofrags Realign reads with Bowtie 2 Collect & compare alignments Write BAMs Compile coverage vectors / write bigWigs Write junctions & indels Distribute Bowtie 2 index of isofrags across cluster Aggregate reads by nucleotide sequence Aggregate readlets by nucleotide sequence Aggregate readlets by read sequence data flow redundancy reduction intermediate step output step
Filter junctions Detect junctions Preprocess reads Align reads with Bowtie
2 / segment into readlets Align readlets with Bowtie 1 Finalize junction combos with Bowtie 2 Enumerate intron configurations Retrieve and index isofrags Realign reads with Bowtie 2 Collect & compare alignments Write BAMs Compile coverage vectors / write bigWigs Write junctions & indels Distribute Bowtie 2 index of isofrags across cluster Aggregate reads by nucleotide sequence Aggregate readlets by nucleotide sequence Aggregate readlets by read sequence data flow redundancy reduction intermediate step output step
Filter junctions Detect junctions Preprocess reads Align reads with Bowtie
2 / segment into readlets Align readlets with Bowtie 1 Finalize junction combos with Bowtie 2 Enumerate intron configurations Retrieve and index isofrags Realign reads with Bowtie 2 Collect & compare alignments Write BAMs Compile coverage vectors / write bigWigs Write junctions & indels Distribute Bowtie 2 index of isofrags across cluster Aggregate reads by nucleotide sequence Aggregate readlets by nucleotide sequence Aggregate readlets by read sequence data flow redundancy reduction intermediate step output step
Easy to deploy
http://rail.bio rail-rna go elastic —-manifest URLsOf500Samples.txt —-assembly hg38 —-output s3://your-bucket/output_folder
—-core-instance-count 20 —-core-instance-type c3.2xlarge rail-rna go parallel —-manifest URLsOf500Samples.txt —x /path/to/hg38_bowtie_basename —-output /path/to/output_folder Same outputs, different environments, reproducible Cloud w/ AWS EMR Local cluster w/ SGE
Ran Rail-RNA on 49,849 RNA-seq runs from the Sequence Read
Archive (over 150 terabases of reads)
+ • Rapid: 2 weeks to results • Repeatable: http://github.com/nellore/runs
for commands • Inexpensive: ~$1.40/sample
None
Borrows strength across samples
Borrowing strength …ATACATCAGACTAGACCGTACCACAGTAGTTCATGACCCTCAGCAGCATGACAGTCATTCGACGTACTCGTATCGATACAGTACAGTAGCC… intron CATAGCATGACAGTCATTCGACGTACTCGTATCGATACAGTACAGTAGCC ATACATCAGACTAGACCGTACCACACAGCATGACAGTCATTCGACGTACT chr1 read 2 found
to overlap junction on realignment read 1 Realignment after collecting and filtering a list of junctions across samples. sample 1 sample 2
81,066,376 junctions across 49,849 SRA samples vs. 540,746 annotated junctions
Why discrepancy? On single sample, every aligner finds some good
junctions and some duds goods duds junctions
Why discrepancy? But much more overlap between goods than between
duds across many samples vs.
Why discrepancy? So as you add samples… goods duds junctions
goods duds junctions
Junction filter Keep a junction if and only if it’s
initially detected in: (1) 5% of samples OR (2) at least 5 reads in any one sample
Rail-RNA: accuracy (mean ± stdev) exon-exon junction accuracy metrics across
20 GEUVADIS-based simulations Precisions Recalls F-scores Rail single .984 ± .000 .880 ± .004 .929 ± .002 Rail all no filter .846 ± .002 .957 ± .001 .898 ± .001 Rail all filter .976 ± .000 .939 ± .003 .957 ± .002
Writes compact outputs
Compact outputs • junction X sample table • 17 GB
compressed for 50k SRA samples • v1 spans 21.5k samples: available at http://intropolis.rail.bio • v2 w/ 50k coming • coverage bigWigs • 10x smaller than BAM
Annotation-agnostic pipeline derfinder Leo Collado-Torres Alyssa Frazee http://rail.bio biocLite("derfinder") sidesteps
assembly & annotation limitations resolves isoform-level features
http://docs.rail.bio
https://github.com/nellore/rail tested!
Rail-RNA: Scalable analysis of RNA-seq splicing and coverage http://rail.bio Ben
Langmead Jeff Leek Leo Collado-Torres Andrew Jaffe José Alquicira Hernández Summer intern: Jamie Morton Chris Wilks Jacob Pritt