HiTSeq '16 slides

Scalable analysis of RNA-seq splicing and coverage @AbhiNellore at HiTSeq
‘16 Langmead & Leek Labs Johns Hopkins University http://rail.bio

Alignment …ATACATCAGACTAGACCGTACCACTCATAGACCTAGACCAGATACAG… CAGACTAGACCGTACCACTCATAGACCTAGACCAGATAC chr1 Sometimes, a read correctly aligns to
the reference genome end to end. read

Spliced alignment Other times, exon-exon junctions are overlapped. Rail-RNA divides
the read into readlets… ATACATCAGACTAGACCGTACCACACAGCATGACAGTCATTCGACGTACT ATACATCAGACTAGACCGTACCACA ATCAGACTAGACCGTACCACACAGC GACTAGACCGTACCACACAGCATGA AGACCGTACCACACAGCATGACAGT CGTACCACACAGCATGACAGTCATT CCACACAGCATGACAGTCATTCGAC ACAGCATGACAGTCATTCGACGTAC CAGCATGACAGTCATTCGACGTACT ATACATCAGACTAGA ATACATCAGACTAGAC ATACATCAGACTAGACCG ATACATCAGACTAGACCGT ATACATCAGACTAGACCGTAC ATACATCAGACTAGACCGTACAGC AGCATGACAGTCATTCGACGTACT ATGACAGTCATTCGACGTACT GACAGTCATTCGACGTACT ACAGTCATTCGACGTACT AGTCATTCGACGTACT GTCATTCGACGTACT read readlets

Spliced alignment …ATACATCAGACTAGACCGTACCACAGTAGTTCATGACCCTCAGCAGCATGACAGTCATTCGACGTACTCGTATCGATACAGTACAGTAGCC… intron CACAGCATGACAGTCATTCGACGTACTCGTATCGATACAGTACAGTAGCC ATACATCAGACTAGACCGTACCACACAGCATGACAGTCATTCGACGTACT chr1 read 2 needs
realignment to ﬁnd junction read 1 …and align readlets to the genome to infer introns. Realignment may be necessary.

Why Rail-RNA • Works on many samples, many cores •
Easy to deploy in different computing environments • Borrows strength across samples • Writes many compact, queryable outputs

Many samples, many cores

Scaling Use MapReduce. Example: • Divide computer cluster into workers
controlled by a master • Divide problem up into sequence of aggregation and computation steps

Filter junctions Detect junctions Preprocess reads Align reads with Bowtie
2 / segment into readlets Align readlets with Bowtie 1 Finalize junction combos with Bowtie 2 Enumerate intron configurations Retrieve and index isofrags Realign reads with Bowtie 2 Collect & compare alignments Write BAMs Compile coverage vectors / write bigWigs Write junctions & indels Distribute Bowtie 2 index of isofrags across cluster Aggregate reads by nucleotide sequence Aggregate readlets by nucleotide sequence Aggregate readlets by read sequence data ﬂow redundancy reduction intermediate step output step

Easy to deploy

http://rail.bio rail-rna go elastic —-manifest URLsOf500Samples.txt —-assembly hg38 —-output s3://your-bucket/output_folder
—-core-instance-count 20 —-core-instance-type c3.2xlarge rail-rna go parallel —-manifest URLsOf500Samples.txt —x /path/to/hg38_bowtie_basename —-output /path/to/output_folder Same outputs, different environments, reproducible Cloud w/ AWS EMR Local cluster w/ SGE

Ran Rail-RNA on 49,849 RNA-seq runs from the Sequence Read
Archive (over 150 terabases of reads)

+ • Rapid: 2 weeks to results • Repeatable: http://github.com/nellore/runs
for commands • Inexpensive: ~$1.40/sample

Borrows strength across samples

Borrowing strength …ATACATCAGACTAGACCGTACCACAGTAGTTCATGACCCTCAGCAGCATGACAGTCATTCGACGTACTCGTATCGATACAGTACAGTAGCC… intron CATAGCATGACAGTCATTCGACGTACTCGTATCGATACAGTACAGTAGCC ATACATCAGACTAGACCGTACCACACAGCATGACAGTCATTCGACGTACT chr1 read 2 found
to overlap junction on realignment read 1 Realignment after collecting and ﬁltering a list of junctions across samples. sample 1 sample 2

81,066,376 junctions across 49,849 SRA samples vs. 540,746 annotated junctions

Why discrepancy? On single sample, every aligner ﬁnds some good
junctions and some duds goods duds junctions

Why discrepancy? But much more overlap between goods than between
duds across many samples vs.

Why discrepancy? So as you add samples… goods duds junctions
goods duds junctions

Junction ﬁlter Keep a junction if and only if it’s
initially detected in: (1) 5% of samples OR (2) at least 5 reads in any one sample

Rail-RNA: accuracy (mean ± stdev) exon-exon junction accuracy metrics across
20 GEUVADIS-based simulations Precisions Recalls F-scores Rail single .984 ± .000 .880 ± .004 .929 ± .002 Rail all no ﬁlter .846 ± .002 .957 ± .001 .898 ± .001 Rail all ﬁlter .976 ± .000 .939 ± .003 .957 ± .002

Writes compact outputs

Compact outputs • junction X sample table • 17 GB
compressed for 50k SRA samples • v1 spans 21.5k samples: available at http://intropolis.rail.bio • v2 w/ 50k coming • coverage bigWigs • 10x smaller than BAM

Annotation-agnostic pipeline derfinder Leo Collado-Torres Alyssa Frazee http://rail.bio biocLite("derfinder") sidesteps
assembly & annotation limitations resolves isoform-level features

http://docs.rail.bio

https://github.com/nellore/rail tested!

Rail-RNA: Scalable analysis of RNA-seq splicing and coverage http://rail.bio Ben
Langmead Jeff Leek Leo Collado-Torres Andrew Jaffe José Alquicira Hernández Summer intern: Jamie Morton Chris Wilks Jacob Pritt

HiTSeq '16 slides

HiTSeq '16 slides

verve

Other Decks in Research

Featured

Transcript

Scalable analysis of RNA-seq splicing and coverage @AbhiNellore at HiTSeq

Alignment …ATACATCAGACTAGACCGTACCACTCATAGACCTAGACCAGATACAG… CAGACTAGACCGTACCACTCATAGACCTAGACCAGATAC chr1 Sometimes, a read correctly aligns to

Spliced alignment Other times, exon-exon junctions are overlapped. Rail-RNA divides

Spliced alignment …ATACATCAGACTAGACCGTACCACAGTAGTTCATGACCCTCAGCAGCATGACAGTCATTCGACGTACTCGTATCGATACAGTACAGTAGCC… intron CACAGCATGACAGTCATTCGACGTACTCGTATCGATACAGTACAGTAGCC ATACATCAGACTAGACCGTACCACACAGCATGACAGTCATTCGACGTACT chr1 read 2 needs

Why Rail-RNA • Works on many samples, many cores •

Many samples, many cores

Scaling Use MapReduce. Example: • Divide computer cluster into workers

Filter junctions Detect junctions Preprocess reads Align reads with Bowtie

Filter junctions Detect junctions Preprocess reads Align reads with Bowtie

Filter junctions Detect junctions Preprocess reads Align reads with Bowtie

Easy to deploy

http://rail.bio rail-rna go elastic —-manifest URLsOf500Samples.txt —-assembly hg38 —-output s3://your-bucket/output_folder

Ran Rail-RNA on 49,849 RNA-seq runs from the Sequence Read

+ • Rapid: 2 weeks to results • Repeatable: http://github.com/nellore/runs

Borrows strength across samples

Borrowing strength …ATACATCAGACTAGACCGTACCACAGTAGTTCATGACCCTCAGCAGCATGACAGTCATTCGACGTACTCGTATCGATACAGTACAGTAGCC… intron CATAGCATGACAGTCATTCGACGTACTCGTATCGATACAGTACAGTAGCC ATACATCAGACTAGACCGTACCACACAGCATGACAGTCATTCGACGTACT chr1 read 2 found

81,066,376 junctions across 49,849 SRA samples vs. 540,746 annotated junctions

Why discrepancy? On single sample, every aligner ﬁnds some good

Why discrepancy? But much more overlap between goods than between

Why discrepancy? So as you add samples… goods duds junctions

Junction ﬁlter Keep a junction if and only if it’s

Rail-RNA: accuracy (mean ± stdev) exon-exon junction accuracy metrics across

Writes compact outputs

Compact outputs • junction X sample table • 17 GB

Annotation-agnostic pipeline derfinder Leo Collado-Torres Alyssa Frazee http://rail.bio biocLite("derfinder") sidesteps

http://docs.rail.bio

https://github.com/nellore/rail tested!

Rail-RNA: Scalable analysis of RNA-seq splicing and coverage http://rail.bio Ben