to millions of samples) – Large experimental crosses – Popula+on studies • BIG Mul+dimensional Data (Gb to Pb of data) – DNA/RNA/Methyla+on Sequencing – Shotgun Proteomics – Metabolomics – Large-‐scale phenotyping • BIG Complexity – Dealing with (and exploi+ng) high gene+c diversity – Computa+onal challenges (must use cloud or hpc resources) – Analy+cal/Sta+s+cal challenges – Mul+ple tes+ng problem – What is signiﬁcant?
• R sta+s+cal language • Python or Perl • Bash/Linux • Visualiza+on Basic Sta+s+cs for BIG data • Distribu+ons, variance, signiﬁcance, normaliza+on, transforma+on • Mul+ple tes+ng problem • Linear regression, mixed models, residuals, principle components analysis/singular value decomposi+on
trying to publish) BS. • Understand what a distribu(on is and how to plot/characterize one: Normal/Gaussian, Poisson, NB • What assump+ons about distribu+on variance are made by speciﬁc signiﬁcance tests (e.g. two-‐tailed Student’s T-‐test)? – Are you comparing two groups (treated/ untreated) or a popula+on? • How do you adjust signiﬁcance thresholds to correct for mul+ple tests?