(genomic) cause, if any, of X disease? Currently: What is the difference between two groups? Tools – Large number of biological samples – High-throughput sequencing – Reverse transcriptase – Biostatistics – Computers
sequence fragment end(s) or paired-‐end: single-‐end: map reads to reference genome summarize mapped reads into feature-‐ level abundances (gene, exon, transcript…) normalize abundances test each feature for differenJal expression Oshlack et al. Genome Biology 2010 Adapted from @acfrazee
abundances 4. analyze Pros: • analogous to microarray, • processed data easy to handle Cons: • incorrect/variable annotation • gene model choices have a big impact assemble-identify 1. align 2. assemble 3. abundances 4. analyze Pros: • alternative transcription • (potentially) less annotation dependent Cons: • ambiguity/variation in assembly Adapted from @leekgroup
(Box-Cox, log(+32) etc.) Yi,j = coverage on sample i at base j lj = genomic location j α() = baseline coverage β() = change in coverage between groups Xi = covariate of interest for sample i γk ()= adjustment’s for confounders Wik = value of kth confounder on ith sample Frazee et al. Biostatistics in review Adapted from @leekgroup base-‐pair model (case/control)
= 0 s(l) = statistic for location l • HMM with three states D(l) = hidden state at location l D(l) = 0 if α(l) = β(l) = 0 (no expression) D(l) = 1 if α(l) ≠ 0 and β(l) = 0 (expressed, but no group difference) D(l) = 2 if β(l) ≠ 0 (group difference!) • Assume statistics conditional on state is Normal s(l)|D(l) = d ~ (µd ,σd 2) f0 : µ0 =0, σ0 2=ε f1 : µ1 =0, σ1 2 f2 : µ2 , σ2 2 • Estimate parameters using a mixture model t ~ π0 f0 + π1 f1 + π2 f2 Frazee et al. Biostatistics in review Adapted from @leekgroup
log2(count+32) chrY: 15016699 - 15017219 female male 1 2 3 4 5 6 7 t statistic xaxinds exons states 15016742 15016842 15016942 15017119 15017219 genomic position Frazee et al. Biostatistics in review Adapted from @leekgroup
log2(count+32) chrY: 2715932-2716691 female male 2.0 2.5 3.0 3.5 4.0 4.5 t statistic xaxinds exons states 2715882 2716082 2716282 2716482 2716682 genomic position Frazee et al. Biostatistics in review Adapted from @leekgroup
…) • Processing coverage by chr into R via Rsamtools • Memory reduction via IRanges – 3 to 43 GB RAM per chr for 30 samples – Alternative: sql database • Linear model fitting via limma • HMM via HiddenMarkov • Custom plotting functions • To-do: reduce wallclock time