Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HiTSeq '16 slides

verve
July 08, 2016

HiTSeq '16 slides

Slides from my HiTSeq '16 talk on Rail-RNA (http://rail.bio).

verve

July 08, 2016
Tweet

Other Decks in Research

Transcript

  1. Scalable analysis of RNA-seq splicing and coverage @AbhiNellore at HiTSeq

    ‘16 Langmead & Leek Labs Johns Hopkins University http://rail.bio
  2. Spliced alignment Other times, exon-exon junctions are overlapped. Rail-RNA divides

    the read into readlets… ATACATCAGACTAGACCGTACCACACAGCATGACAGTCATTCGACGTACT ATACATCAGACTAGACCGTACCACA ATCAGACTAGACCGTACCACACAGC GACTAGACCGTACCACACAGCATGA AGACCGTACCACACAGCATGACAGT CGTACCACACAGCATGACAGTCATT CCACACAGCATGACAGTCATTCGAC ACAGCATGACAGTCATTCGACGTAC CAGCATGACAGTCATTCGACGTACT ATACATCAGACTAGA ATACATCAGACTAGAC ATACATCAGACTAGACCG ATACATCAGACTAGACCGT ATACATCAGACTAGACCGTAC ATACATCAGACTAGACCGTACAGC AGCATGACAGTCATTCGACGTACT ATGACAGTCATTCGACGTACT GACAGTCATTCGACGTACT ACAGTCATTCGACGTACT AGTCATTCGACGTACT GTCATTCGACGTACT read readlets
  3. Why Rail-RNA • Works on many samples, many cores •

    Easy to deploy in different computing environments • Borrows strength across samples • Writes many compact, queryable outputs
  4. Scaling Use MapReduce. Example: • Divide computer cluster into workers

    controlled by a master • Divide problem up into sequence of aggregation and computation steps
  5. Filter junctions Detect junctions Preprocess reads Align reads with Bowtie

    2 / segment into readlets Align readlets with Bowtie 1 Finalize junction combos with Bowtie 2 Enumerate intron configurations Retrieve and index isofrags Realign reads with Bowtie 2 Collect & compare alignments Write BAMs Compile coverage vectors / write bigWigs Write junctions & indels Distribute Bowtie 2 index of isofrags across cluster Aggregate reads by nucleotide sequence Aggregate readlets by nucleotide sequence Aggregate readlets by read sequence data flow redundancy reduction intermediate step output step
  6. Filter junctions Detect junctions Preprocess reads Align reads with Bowtie

    2 / segment into readlets Align readlets with Bowtie 1 Finalize junction combos with Bowtie 2 Enumerate intron configurations Retrieve and index isofrags Realign reads with Bowtie 2 Collect & compare alignments Write BAMs Compile coverage vectors / write bigWigs Write junctions & indels Distribute Bowtie 2 index of isofrags across cluster Aggregate reads by nucleotide sequence Aggregate readlets by nucleotide sequence Aggregate readlets by read sequence data flow redundancy reduction intermediate step output step
  7. Filter junctions Detect junctions Preprocess reads Align reads with Bowtie

    2 / segment into readlets Align readlets with Bowtie 1 Finalize junction combos with Bowtie 2 Enumerate intron configurations Retrieve and index isofrags Realign reads with Bowtie 2 Collect & compare alignments Write BAMs Compile coverage vectors / write bigWigs Write junctions & indels Distribute Bowtie 2 index of isofrags across cluster Aggregate reads by nucleotide sequence Aggregate readlets by nucleotide sequence Aggregate readlets by read sequence data flow redundancy reduction intermediate step output step
  8. http://rail.bio rail-rna go elastic —-manifest URLsOf500Samples.txt —-assembly hg38 —-output s3://your-bucket/output_folder

    —-core-instance-count 20 —-core-instance-type c3.2xlarge rail-rna go parallel —-manifest URLsOf500Samples.txt —x /path/to/hg38_bowtie_basename —-output /path/to/output_folder Same outputs, different environments, reproducible Cloud w/ AWS EMR Local cluster w/ SGE
  9. Ran Rail-RNA on 49,849 RNA-seq runs from the Sequence Read

    Archive (over 150 terabases of reads)
  10. Why discrepancy? On single sample, every aligner finds some good

    junctions and some duds goods duds junctions
  11. Junction filter Keep a junction if and only if it’s

    initially detected in: (1) 5% of samples OR (2) at least 5 reads in any one sample
  12. Rail-RNA: accuracy (mean ± stdev) exon-exon junction accuracy metrics across

    20 GEUVADIS-based simulations Precisions Recalls F-scores Rail single .984 ± .000 .880 ± .004 .929 ± .002 Rail all no filter .846 ± .002 .957 ± .001 .898 ± .001 Rail all filter .976 ± .000 .939 ± .003 .957 ± .002
  13. Compact outputs • junction X sample table • 17 GB

    compressed for 50k SRA samples • v1 spans 21.5k samples: available at http://intropolis.rail.bio • v2 w/ 50k coming • coverage bigWigs • 10x smaller than BAM
  14. Rail-RNA: Scalable analysis of RNA-seq splicing and coverage http://rail.bio Ben

    Langmead Jeff Leek Leo Collado-Torres Andrew Jaffe José Alquicira Hernández Summer intern: Jamie Morton Chris Wilks Jacob Pritt