Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BIMS 8382 Intro

Stephen Turner
February 15, 2016

BIMS 8382 Intro

Slides for first meeting of BIMS 8382: Introduction to Biological Data Science

Stephen Turner

February 15, 2016
Tweet

More Decks by Stephen Turner

Other Decks in Education

Transcript

  1. INTRODUCTION TO Stephen D. Turner, Ph.D. University of Virginia Bioinformatics

    Core Director bioinformatics.virginia.edu BIOLOGICAL DATA SCIENCE BIMS 8382 bioconnector.org/bims8382
  2. About the course • See course website (bioconnector.org/bims8382) for syllabus,

    FAQ, course material, etc. • 90% hands on, live-coding in class. - Bring your laptop and charger every day. • Heavily focused on R - Scientific computing is a lab skill, just like pipetting, cell culture, etc. - R’s ecosystem for data analysis, especially in bioinformatics, is simply amazing (more later…) 2
  3. About the course • Weeks 1-4: learn R - Introduction

    to R environment - Advanced data manipulation - Advanced data visualization - Reproducible research • Weeks 5-6: use R for actual analysis - Lecture/overview of RNA-seq - Analyzing RNA-seq data in R 3
  4. We Are in the Middle of a New Movement in

    Genomics 5 Genomics & bioinformatics is advancing at grueling pace •New questions •New study designs •New technologies, new [fill-in-the-blank]-seq
  5. 6

  6. 7 dsRNA-Seq FRAG-Seq SHAPE-Seq PARTE-Seq PARS-Seq Structure-Seq DMS-Seq Cir-Seq Dup-Seq

    Nucleo-Seq DNAse-Seq DNAseI-Seq Sono-Seq FAIRE-Seq NOMe-Seq ATAC-Seq RAD-Seq Freq-Seq CNV-Seq Novel-Seq TAm-Seq Repli-Seq ARS-Seq Sort-Seq Pool-Seq Bubble-Seq RNA-Seq GRO-Seq Quartz-Seq CAGE-Seq Nascent-Seq Cel-Seq 3P-Seq NET-Seq SS3-Seq FRT-Seq 3-Seq PRO-Seq Bru-Seq TIF-Seq TIVA-Seq Smart-Seq PAS-Seq PAL-Seq Ribo-Seq Frac-Seq GTI-Seq SELEX-Seq CRE-Seq STARR-Seq SRE-Seq CLASH-Seq ChIRP-Seq CHART-Seq RAP-Seq RIP-Seq iCLIP-Seq PTB-Seq ChIP-Seq ChIP-Seq PB-Seq PDZ-Seq PD-Seq Chem-Seq CAB-Seq HELP-Seq TAB-Seq TAmC-Seq fCAB-Seq MeDIP-Seq Methyl-Seq oxBS-Seq RBBS-Seq BS-Seq BisChIP-Seq Bar-Seq TraDI-Seq Tn-Seq IN-Seq Immuno-Seq mutARS-Seq Ig-Seq Ig-seq Ren-Seq Mu-Seq Stable-Seq WIMP-Seq BOINC-Seq https://liorpachter.wordpress.com/seq/
  7. 9 E.g.: workflows for RNA-seq Eyras et al. Methods to

    Study Splicing from RNA-Seq. http://dx.doi.org/10.6084/m9.figshare.679993
  8. • 9000+ R packages available • >1000 bioinformatics-specific R packages

    • ~200 R packages for gene expression • ~100 R packages just for RNA-seq! • Each have their own idiosyncrasies, usage, strengths/weaknesses, goals. • Many tools that were state of the art in 2015 are obsolete in 2016. orig: @aaronquinlan
  9. Goal of this course: I’m NOT going to teach you

    tool X for analysis Y, because you probably won’t do Y, and you almost certainly won’t use tool X next year. Goal: get comfortable with the scientific computing environment (data manipulation, analysis, reproducible research, external packages, finding help) so you can figure out how to do analysis Y with tool X when you need to. 12
  10. R is FREE. 16 Software Cost $1,140 - $4,370 +

    maintenance $8,700 - $140,000 / year $2,390 - $40,600 / year $2,150 + $1,000s for modules $0
  11. R Community 18 NYT: R is the “lingua franca” of

    data analysts inside corporations and academia. Norman Nie, scholar and co- founder of SPSS: R is “the most powerful and flexible statistical programming language in the world.”
  12. R Community • Over 1000 free packages for bioinformatics analysis

    using R. • NGS analysis: - Manipulate: import FASTQ/bam, trim, transform, align, manipulate sequences, … - Applications: Quality Assessment, ChIP-seq, differential expression, RNA-seq, much more. - Annotation: gene, pathway, GO, homology, … Access GO, KEGG, NCBI, Biomart, UCSC, … • Much, much more: flow cytometry, DNA methylation, microarrays, TFBS analysis, eQTL analysis, functional annotation, … • BioC Community: Conferences (since 2002), mailing list, … • http://bioconductor.org/ 20
  13. R and Big Data • What’s big data? - Too

    large to process using traditional processing applications 
 — Wikipedia - “Volume, velocity, variety” 
 — Doug Laney, 2001 - “When computing the answer takes longer than the cognitive process of designing the model” 
 — Hadley Wickham, R developer 29
  14. R and Big Data • ff: access datasets too large

    to fit into memory • bigmemory: store large objects in memory and files with external pointer, enabling transparent access from R to large objects. • pbdMPI: Interface to MPI • pbdNCDF4: multiple processes can read/write same file • snow (simple network of workstations): abstraction layer, hiding communication details from parallelized processed. • foreach: iterate over a collection without loop counter. • multicore: run parallel computation on computers with multiple cores without explicit user request. • RHIPE: interface between R and Hadoop • BatchJobs: Map/Reduce functionality to HPC systems using Torque/PBS, SGE, LSF, etc. • gputools: common data-mining algorithms implemented using nVidia CUDA language/library • Many, many more at http://cran.r-project.org/web/views/HighPerformanceComputing.html 30
  15. R as a programming language • New tools/procedures can be

    written in R, shared, and used by others. • Open-source. - Don’t know what a function does? Look at the code yourself. - Don’t like how a function works? Hack the code and re-write how it works yourself. • R packages: Extend R with more functions, data, graphics. - CRAN: >9,000 packages - Bioconductor: >1,000 packages 32
  16. R as a programming language • Reproducible research - Point

    & click interfaces are NOT reproducible. - R code is written in plain text file. Running same code on same data should reproduce exact results. - R “scripts” are easily shared. - Latex, Knitr: Allow seamless integration of R code into self- documenting report. 33
  17. Resources 37 Programming Q&A Site. Over 40,000 questions tagged with

    “R”: http://stackoverflow.com/ CrossValidated Statistics Q&A Site. Over 1,000 questions tagged with “R”: http://stats.stackexchange.com/
  18. Resources 39 TryR: A short, interactive course to let you

    jump right in. Learn and run code right in the browser. http://tryr.codeschool.com/ http://www.rseek.org/ A custom Google search engine for R-related topics.