Save 37% off PRO during our Black Friday Sale! »

BIMS 8382 Intro

8c8cb9d49f0ff8139e459414aeb4c055?s=47 Stephen Turner
February 15, 2016

BIMS 8382 Intro

Slides for first meeting of BIMS 8382: Introduction to Biological Data Science

8c8cb9d49f0ff8139e459414aeb4c055?s=128

Stephen Turner

February 15, 2016
Tweet

Transcript

  1. INTRODUCTION TO Stephen D. Turner, Ph.D. University of Virginia Bioinformatics

    Core Director bioinformatics.virginia.edu BIOLOGICAL DATA SCIENCE BIMS 8382 bioconnector.org/bims8382
  2. About the course • See course website (bioconnector.org/bims8382) for syllabus,

    FAQ, course material, etc. • 90% hands on, live-coding in class. - Bring your laptop and charger every day. • Heavily focused on R - Scientific computing is a lab skill, just like pipetting, cell culture, etc. - R’s ecosystem for data analysis, especially in bioinformatics, is simply amazing (more later…) 2
  3. About the course • Weeks 1-4: learn R - Introduction

    to R environment - Advanced data manipulation - Advanced data visualization - Reproducible research • Weeks 5-6: use R for actual analysis - Lecture/overview of RNA-seq - Analyzing RNA-seq data in R 3
  4. But why don’t we spend more time learning tool X

    for analysis Y? 4
  5. We Are in the Middle of a New Movement in

    Genomics 5 Genomics & bioinformatics is advancing at grueling pace •New questions •New study designs •New technologies, new [fill-in-the-blank]-seq
  6. 6

  7. 7 dsRNA-Seq FRAG-Seq SHAPE-Seq PARTE-Seq PARS-Seq Structure-Seq DMS-Seq Cir-Seq Dup-Seq

    Nucleo-Seq DNAse-Seq DNAseI-Seq Sono-Seq FAIRE-Seq NOMe-Seq ATAC-Seq RAD-Seq Freq-Seq CNV-Seq Novel-Seq TAm-Seq Repli-Seq ARS-Seq Sort-Seq Pool-Seq Bubble-Seq RNA-Seq GRO-Seq Quartz-Seq CAGE-Seq Nascent-Seq Cel-Seq 3P-Seq NET-Seq SS3-Seq FRT-Seq 3-Seq PRO-Seq Bru-Seq TIF-Seq TIVA-Seq Smart-Seq PAS-Seq PAL-Seq Ribo-Seq Frac-Seq GTI-Seq SELEX-Seq CRE-Seq STARR-Seq SRE-Seq CLASH-Seq ChIRP-Seq CHART-Seq RAP-Seq RIP-Seq iCLIP-Seq PTB-Seq ChIP-Seq ChIP-Seq PB-Seq PDZ-Seq PD-Seq Chem-Seq CAB-Seq HELP-Seq TAB-Seq TAmC-Seq fCAB-Seq MeDIP-Seq Methyl-Seq oxBS-Seq RBBS-Seq BS-Seq BisChIP-Seq Bar-Seq TraDI-Seq Tn-Seq IN-Seq Immuno-Seq mutARS-Seq Ig-Seq Ig-seq Ren-Seq Mu-Seq Stable-Seq WIMP-Seq BOINC-Seq https://liorpachter.wordpress.com/seq/
  8. 8 http://www.ebi.ac.uk/~nf/hts_mappers/ E.g.: tools for short read alignment

  9. 9 E.g.: workflows for RNA-seq Eyras et al. Methods to

    Study Splicing from RNA-Seq. http://dx.doi.org/10.6084/m9.figshare.679993
  10. 10 http://omictools.com/ >10,000 software tools for -omics data analysis http://www.mybiosoftware.com/biology-software-list

    >12,000 software tools, publications
  11. • 9000+ R packages available • >1000 bioinformatics-specific R packages

    • ~200 R packages for gene expression • ~100 R packages just for RNA-seq! • Each have their own idiosyncrasies, usage, strengths/weaknesses, goals. • Many tools that were state of the art in 2015 are obsolete in 2016. orig: @aaronquinlan
  12. Goal of this course: I’m NOT going to teach you

    tool X for analysis Y, because you probably won’t do Y, and you almost certainly won’t use tool X next year. Goal: get comfortable with the scientific computing environment (data manipulation, analysis, reproducible research, external packages, finding help) so you can figure out how to do analysis Y with tool X when you need to. 12
  13. Let’s get started with R! 13

  14. Why Because R is awesome. R?

  15. R is FREE. 15 Free, as in beer. Free, as

    in speech.
  16. R is FREE. 16 Software Cost $1,140 - $4,370 +

    maintenance $8,700 - $140,000 / year $2,390 - $40,600 / year $2,150 + $1,000s for modules $0
  17. R Community

  18. R Community 18 NYT: R is the “lingua franca” of

    data analysts inside corporations and academia. Norman Nie, scholar and co- founder of SPSS: R is “the most powerful and flexible statistical programming language in the world.”
  19. R Community 19 CRAN = Comprehensive R Archive Network http://cran.us.r-project.org/

    Over 9,000 free add-on packages.
  20. R Community • Over 1000 free packages for bioinformatics analysis

    using R. • NGS analysis: - Manipulate: import FASTQ/bam, trim, transform, align, manipulate sequences, … - Applications: Quality Assessment, ChIP-seq, differential expression, RNA-seq, much more. - Annotation: gene, pathway, GO, homology, … Access GO, KEGG, NCBI, Biomart, UCSC, … • Much, much more: flow cytometry, DNA methylation, microarrays, TFBS analysis, eQTL analysis, functional annotation, … • BioC Community: Conferences (since 2002), mailing list, … • http://bioconductor.org/ 20
  21. Amazing Graphics edwardtufte.com

  22. ggplot(data=diamonds, aes(x=carat, y=price)) + geom_point() + facet_grid(clarity~color) GettingGeneticsDone.com

  23. manhattan(data, annotate=snps) cran.r-project.org/web/packages/qqman/

  24. http://nyti.ms/1pvhu2t http://nyti.ms/1ff2bWa

  25. “Visualizing Friendships” http://on.fb.me/MlI0NI

  26. The Arteries of the World, in Tweets https://blog.twitter.com/2013/the-geography-of-tweets

  27. K-means Clustering 86 Single Malt Scotch Whiskies http://blog.revolutionanalytics.com/2013/12/k-means- clustering-86-single-malt-scotch-whiskies.html

  28. BIG DATA and

  29. R and Big Data • What’s big data? - Too

    large to process using traditional processing applications 
 — Wikipedia - “Volume, velocity, variety” 
 — Doug Laney, 2001 - “When computing the answer takes longer than the cognitive process of designing the model” 
 — Hadley Wickham, R developer 29
  30. R and Big Data • ff: access datasets too large

    to fit into memory • bigmemory: store large objects in memory and files with external pointer, enabling transparent access from R to large objects. • pbdMPI: Interface to MPI • pbdNCDF4: multiple processes can read/write same file • snow (simple network of workstations): abstraction layer, hiding communication details from parallelized processed. • foreach: iterate over a collection without loop counter. • multicore: run parallel computation on computers with multiple cores without explicit user request. • RHIPE: interface between R and Hadoop • BatchJobs: Map/Reduce functionality to HPC systems using Torque/PBS, SGE, LSF, etc. • gputools: common data-mining algorithms implemented using nVidia CUDA language/library • Many, many more at http://cran.r-project.org/web/views/HighPerformanceComputing.html 30
  31. programming as a language

  32. R as a programming language • New tools/procedures can be

    written in R, shared, and used by others. • Open-source. - Don’t know what a function does? Look at the code yourself. - Don’t like how a function works? Hack the code and re-write how it works yourself. • R packages: Extend R with more functions, data, graphics. - CRAN: >9,000 packages - Bioconductor: >1,000 packages 32
  33. R as a programming language • Reproducible research - Point

    & click interfaces are NOT reproducible. - R code is written in plain text file. Running same code on same data should reproduce exact results. - R “scripts” are easily shared. - Latex, Knitr: Allow seamless integration of R code into self- documenting report. 33
  34. 34 Demo: Reproducible Research with R

  35. Resources

  36. Resources 36 http://blog.revolutionanalytics.com/ R Mailing List: http://www.r-project.org/mail.html Bioconductor Mailing List:

    http://www.bioconductor.org/help/mailing-list/
  37. Resources 37 Programming Q&A Site. Over 40,000 questions tagged with

    “R”: http://stackoverflow.com/ CrossValidated Statistics Q&A Site. Over 1,000 questions tagged with “R”: http://stats.stackexchange.com/
  38. Resources 38 Computing for Data Analysis https://www.coursera.org/course/compdata R Programming: https://www.coursera.org/course/rprog

    Roger Peng: All videos on YouTube: http://www.youtube.com/user/rdpeng/videos
  39. Resources 39 TryR: A short, interactive course to let you

    jump right in. Learn and run code right in the browser. http://tryr.codeschool.com/ http://www.rseek.org/ A custom Google search engine for R-related topics.
  40. Resources 40 Editor Console Workspace Graphics RStudio: A beautiful, free,

    full-featured IDE. http://www.rstudio.com/
  41. Resources 41 bioconnector.org/bims8382/help

  42. Local Resources 42 StatLab statlab.library.virginia.edu bioconnector.virginia.edu/phs PHS@HSL Stats Questions? bioconnector.virginia.edu/dash

    DASH Data Analysis Support Hub
  43. Local Resources 43 Web: bioinformatics.virginia.edu E-Mail: bioinformatics@virginia.edu Blog: GettingGeneticsDone.com Twitter:

    @genetics_blog Facebook: facebook.com/UVABioinformaticsCore