FAQ, course material, etc. • 90% hands on, live-coding in class. - Bring your laptop and charger every day. • Heavily focused on R - Scientific computing is a lab skill, just like pipetting, cell culture, etc. - R’s ecosystem for data analysis, especially in bioinformatics, is simply amazing (more later…) 2
to R environment - Advanced data manipulation - Advanced data visualization - Reproducible research • Weeks 5-6: use R for actual analysis - Lecture/overview of RNA-seq - Analyzing RNA-seq data in R 3
• ~200 R packages for gene expression • ~100 R packages just for RNA-seq! • Each have their own idiosyncrasies, usage, strengths/weaknesses, goals. • Many tools that were state of the art in 2015 are obsolete in 2016. orig: @aaronquinlan
tool X for analysis Y, because you probably won’t do Y, and you almost certainly won’t use tool X next year. Goal: get comfortable with the scientific computing environment (data manipulation, analysis, reproducible research, external packages, finding help) so you can figure out how to do analysis Y with tool X when you need to. 12
data analysts inside corporations and academia. Norman Nie, scholar and co- founder of SPSS: R is “the most powerful and flexible statistical programming language in the world.”
large to process using traditional processing applications — Wikipedia - “Volume, velocity, variety” — Doug Laney, 2001 - “When computing the answer takes longer than the cognitive process of designing the model” — Hadley Wickham, R developer 29
to fit into memory • bigmemory: store large objects in memory and files with external pointer, enabling transparent access from R to large objects. • pbdMPI: Interface to MPI • pbdNCDF4: multiple processes can read/write same file • snow (simple network of workstations): abstraction layer, hiding communication details from parallelized processed. • foreach: iterate over a collection without loop counter. • multicore: run parallel computation on computers with multiple cores without explicit user request. • RHIPE: interface between R and Hadoop • BatchJobs: Map/Reduce functionality to HPC systems using Torque/PBS, SGE, LSF, etc. • gputools: common data-mining algorithms implemented using nVidia CUDA language/library • Many, many more at http://cran.r-project.org/web/views/HighPerformanceComputing.html 30
written in R, shared, and used by others. • Open-source. - Don’t know what a function does? Look at the code yourself. - Don’t like how a function works? Hack the code and re-write how it works yourself. • R packages: Extend R with more functions, data, graphics. - CRAN: >9,000 packages - Bioconductor: >1,000 packages 32
& click interfaces are NOT reproducible. - R code is written in plain text file. Running same code on same data should reproduce exact results. - R “scripts” are easily shared. - Latex, Knitr: Allow seamless integration of R code into self- documenting report. 33
jump right in. Learn and run code right in the browser. http://tryr.codeschool.com/ http://www.rseek.org/ A custom Google search engine for R-related topics.