Slide 1

Slide 1 text

Ben Langmead Assistant Professor, JHU Computer Science [email protected], langmead-lab.org, @BenLangmead University of Utah, November 11, 2016 Scalable analysis of many sequencing datasets at once

Slide 2

Slide 2 text

Langmead lab Efficiency Scalability + Bowtie, Bowtie 2, Lighter, Arioc, HISAT Rail-RNA, Boiler, Rail-dbGaP Resources + Intropolis, recount, Snaptron

Slide 3

Slide 3 text

Sequence Read Archive (SRA) growth Terabases Open access Total 1 Pbp 3 -> 6 Pbp in ~18 months https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement

Slide 4

Slide 4 text

Sequence Read Archive (SRA) growth Terabases Open access Total 1 Pbp 3 -> 6 Pbp in ~18 months https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement 6.15 -> 8.72 Pbp in October Open access Total

Slide 5

Slide 5 text

MapReduce Elastic MapReduce Spot Marketplace

Slide 6

Slide 6 text

Abhinav Nellore JHU & OHSU Jeff Leek, JHU Website: http://rail.bio, Paper: http://bit.ly/rail-aa

Slide 7

Slide 7 text

From reads to alignments, coverage vectors & junctions Aggregate across samples to borrow strength and eliminate redundant work Annotation agnostic: let data, not annotation, prune the junction space Spliced RNA-seq aligner for analyzing many samples at once Website: http://rail.bio, Paper: http://bit.ly/rail-aa

Slide 8

Slide 8 text

Pass 1: align to genome, make junction calls Pass 2: re-align to genome with putative junctions Reads: Ref: Readlets:

Slide 9

Slide 9 text

Sample 1 S 2 S 3 S 4 S 5 Candidate Junction 1 CJ 2 CJ 3 Aggregating across samples adds a dimension to junction call confidence

Slide 10

Slide 10 text

Rail-RNA: design Preprocess Aggregate duplicate reads Split into readlets Aggregate duplicate readlets Correlation clustering for readlet alignments Call splice junction Merge exon differentials Compile sample coverages Write bigWigs Write normalization factors Write spliced alignment BAMs Write junction & indel BEDs Align reads end-to-end to genome Align readlets to genome Align readlets to junction co-occurrence index Bowtie 2 Bowtie Bowtie

Slide 11

Slide 11 text

Sample 1 Sample 2 Sample 3 Log coverage

Slide 12

Slide 12 text

Marginal cost of analyzing 1 additional sample decreases as we add more samples Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

Slide 13

Slide 13 text

Website: http://rail.bio Paper: http://bit.ly/rail-aa Repo: https://github.com/nellore/rail Chat: https://gitter.im/nellore/rail dbGaP Website: http://docs.rail.bio/dbgap/ Paper: http://bit.ly/rail_dbgap Abhinav Nellore JHU & OHSU Jeff Leek, JHU

Slide 14

Slide 14 text

What is the annotation tradeoff? Some new tools start with annotation; no attempt to discover junctions / isoforms Major projects (GTEx, GEUVADIS, TCGA) quantitate directly from annotated transcripts What is the nature of this tradeoff? How complete are the annotations?

Slide 15

Slide 15 text

What is the annotation tradeoff? • Analyzed ~50,000 human RNA-seq samples with Rail-RNA; about 150 Tbp • Repeatable: http://github.com/nellore/runs • ~ $1.40 / sample (Compare to sequencing costs) (Exact commands we used to run on AWS)

Slide 16

Slide 16 text

a 0 2000 4000 6000 8000 10000 12000 14000 0 100000 200000 300000 400000 500000 600000 700000 Minimum number S of samples in which jx is called Junction (jx) count J 18.6% 56,861 jx 100% 96.5% 81.4% 85.8% Novel Alternative donor/acceptor Exon skip Fully annotated 800 900 1000 1100 1200 240000 260000 280000 300000 320000 b 8000 10000 samples c 2500 3000 Nellore A, et al. Human splicing diversity across the Sequence Read Archive. bioRxiv (2016): 038224. Annotation includes: UCSC, GENCODE v19 & v24, RefSeq, CCDS, MGC, lincRNAs, SIB genes, AceView, Vega

Slide 17

Slide 17 text

0 2000 4000 6000 8000 10000 0 20 40 60 80 Samples % called junctions that are annotated For ~2.5% of samples, <50% of junction calls are annotated Median fraction of junction calls that are annotated: ~80% GENCODE v19

Slide 18

Slide 18 text

Djebali, Sarah, et al. "Landscape of transcription in human cells." Nature 489.7414 (2012): 101-108.

Slide 19

Slide 19 text

RNA-seq & annotation Spliced alignment Isoform assembly Isoform quantitation Count overlaps w/ annotated features Differential gene / exon expression (often with annotation) (quasi-, pseudo-)

Slide 20

Slide 20 text

RNA-seq: a third way Spliced alignment Rail-RNA: accurate, annotation-agnostic Differentially expressed region finder derfinder: region-based, annotation-agnostic See also: GEUVADIS analysis in sec 2.4: http://bit.ly/rail-aa Collado-Torres L et al. Flexible expressed region analysis for RNA-seq with derfinder. Nucleic Acids Res. 2016 Sep 29.

Slide 21

Slide 21 text

Collado-Torres L et al. Flexible expressed region analysis for RNA-seq with derfinder. Nucleic Acids Res. 2016 Sep 29. RNA-seq: a third way See also: GEUVADIS analysis in sec 2.4: http://bit.ly/rail-aa

Slide 22

Slide 22 text

See also: GEUVADIS analysis in sec 2.4: http://bit.ly/rail-aa Spliced alignment Rail-RNA: accurate, annotation-agnostic Differentially expressed region finder derfinder: region-based, annotation-agnostic Collado-Torres L et al. Flexible expressed region analysis for RNA-seq with derfinder. Nucleic Acids Res. 2016 Sep 29. bigWigs RNA-seq: a third way

Slide 23

Slide 23 text

Resources • Intropolis (in press, Genome biology) • Nellore A, et al. Human splicing diversity across the Sequence Read Archive. bioRxiv (2016): 038224. • Snaptron (in preparation) • recount (in revision) • Collado-Torres L, et al. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv (2016): 068478.

Slide 24

Slide 24 text

Intropolis Site: http://intropolis.rail.bio, preprint: http://j.mp/rail-sra-pre Nellore A, et al. Human splicing diversity across the Sequence Read Archive. bioRxiv (2016): 038224. Abhinav Nellore JHU & OHSU

Slide 25

Slide 25 text

Intropolis • Discovery of novel splicing events has leveled off Nellore A, et al. Human splicing diversity across the Sequence Read Archive. bioRxiv (2016): 038224.

Slide 26

Slide 26 text

Intropolis Nellore A, et al. Human splicing diversity across the Sequence Read Archive. bioRxiv (2016): 038224. ABRF SEQC GEU -0.005 0.000 0.005 0.010 0.015 -0.02 -0.01 0.00 0.01 0.02 PC1 PC2

Slide 27

Slide 27 text

Snaptron Users pose flexible queries about splicing Query planner delegates to appropriate systems (sqlite, tabix, lucene) and indexes (R-tree, B-tree, inverted full text) Lucene/Document Inverted Index SQLite/B-tree Index Tabix/R-tree Index 8 Sample Filter Region Limited Region Limited & Filtered Region Junction Records Sample Metadata Records Junction Records Filtered Region Filtered Samples Sample Filter Snaptron Query Planner Query Data Store/Index Output 1 2 6 7 3 9 4 5 10 11 12 13 4 7 3 1 2 8 5 6 Sample Metadata Terms Samples "Brain" 1,2,3,6 "Liver" 4,6,9,11 Chris Wilks

Slide 28

Slide 28 text

Snaptron Web service and UI currently available, preprint soon B C 1 2 1 2 3 1 2 3 4 Example: two unannotated junctions on either side of an exonized repetitive element discovered by colleague Sarven Sabunciyan Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.

Slide 29

Slide 29 text

recount Junctions Genes Coverage Exons • Provides expression summaries at levels of genes, junctions, exons and coverage vectors Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.

Slide 30

Slide 30 text

recount recount : expression data for ~70,000 human samples samples expression estimates gene exon junctions ERs samples phenotypes ? GTEx N=9,962 TCGA N=10,327 SRA N=49,848 1.  Have'’novel’'isoforms'ever'been' seen'previously?'' 2.  What'regions'of'the'human' genome'are'transcribed'in' humans?' 1.  Have'’novel’'isoforms' ever'been'seen' previously?'In#what# tissue?#At#what#levels?# 2.  What'regions'of'the' human'genome'are' transcribed'in'humans' and#in#what#tissues?' 3.  Do'the'same'genes' escape'X'Inactivation' across'all'tissues?' 4.  What'expression' changes'occur'as'we' age?' 5.  …….' Biological Phenotypes: -  Sex -  Age -  Tissue recount: A large-scale resource of analysis-ready RNA-seq expression data Leonardo Collado-Torres, Abhinav Nellore, Kai Kammers, Shannon E Ellis, Margaret A Taub, Kasper D Hansen, Andrew E Jaffe, Ben Langmead, Jeffrey Leek Slide courtesy of Shannon Ellis Abhinav Nellore Leo Collado Torres

Slide 31

Slide 31 text

recount studies Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478. • Tissue meta-analysis: compare colon & blood tissues from SRA, do same from GTEx, compare differential expression rankings • Compare our gene expression measurements with those from GTEx project; high concordance • Compare differential expression results when analysis is performed at the level of gene, exon, junction or DER

Slide 32

Slide 32 text

recount studies https://rpubs.com/crazyhottommy/heatmap_demystified

Slide 33

Slide 33 text

recount • Shiny-app front-end: https://jhubiostatistics.shinyapps.io/recount/ • SciServer Compute lets users to work with locally-hosted data in Jupyter notebook http://compute.sciserver.org/dashboard/ • Bioconductor 3.4 package https://www.bioconductor.org/packages/recount/ • Preprint: http://bit.ly/recount_pre Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.

Slide 34

Slide 34 text

Jeff Leek Jacob Pritt Abhinav Nellore Kasper Hansen Alyssa Frazee Leo Collado Torres Chris Wilks Andrew Jaffe José Alquicira- Hernández Jamie Morton Kai Kammers Shannon Ellis Margaret Taub • NIH R01GM118568 • NSF CAREER IIS-1349906 • Sloan Research Fellowship • IDIES Seed Funding program • Amazon Web Services langmead-lab.org, @BenLangmead Thank you: IDIES Seed funding SciServer SciServer Compute Also for DERfinder: Rafa Irizarry, Sarven Sabunciyan, Mike Love