Introduction to R for Life Scientists

Introduction to R for Life Scientists

Introduction to R workshop sponsored by the UVA Health Sciences Library.


Stephen Turner

June 25, 2014


  1. Introduction to R Stephen D. Turner, Ph.D. Bioinformatics Core Director

    1 Slides available at
  2. Why Because R is awesome. R?

  3. R is FREE. 3 Free, as in beer. Free, as

    in speech.
  4. R is FREE. 4 Software Cost $1,140 - $4,370 +

    maintenance $8,700 - $140,000 / year $2,390 - $40,600 / year $2,150 + $1,000s for modules $0
  5. R Community

  6. R Community 6 NYT: R is the “lingua franca” of

    data analysts inside corporations and academia. Norman Nie, scholar and co- founder of SPSS: R is “the most powerful and flexible statistical programming language in the world.”
  7. R Community 7 KDNuggets Poll Languages used for analytics, data

    mining, data science.
  8. R Community 8 CRAN = Comprehensive R Archive Network ! ! Over 5,000 free add-on packages.
  9. R Community • Nearly 1000 free packages for bioinformatics analysis

    using R. • NGS analysis: - Manipulate: import FASTQ/bam, trim, transform, align, manipulate sequences, … - Applications: Quality Assessment, ChIP-seq, differential expression, RNA-seq, much more. - Annotation: gene, pathway, GO, homology, … Access GO, KEGG, NCBI, Biomart, UCSC, … • Much, much more: flow cytometry, DNA methylation, microarrays, TFBS analysis, eQTL analysis, functional annotation, … • BioC Community: Conferences (since 2002), mailing list, … • 9
  10. R Community 10 Companies using R:

  11. Amazing Graphics

  12. qplot(carat, price, data=diamonds, facets=clarity~color)

  13. manhattan(data, annotate=snps)



  16. “Visualizing Friendships”

  17. The Arteries of the World, in Tweets


  19. Wind speeds. Inspired by, currently on display at NYC

    MOMA. @cambecc Prosperity in France. 2011/12/11/ah-36-000-communes/ (Article in French). @coulmont
  20. K-means Clustering 86 Single Malt Scotch Whiskies clustering-86-single-malt-scotch-whiskies.html

  21. BIG DATA and

  22. R and Big Data • What’s big data? - Too

    large to process using traditional processing applications 
 — Wikipedia - “Volume, velocity, variety” 
 — Doug Laney, 2001 - “When computing the answer takes longer than the cognitive process of designing the model” 
 — Hadley Wickham, R developer 22
  23. R and Big Data • ff: access datasets too large

    to fit into memory • bigmemory: store large objects in memory and files with external pointer, enabling transparent access from R to large objects. • pbdMPI: Interface to MPI • pbdNCDF4: multiple processes can read/write same file • snow (simple network of workstations): abstraction layer, hiding communication details from parallelized processed. • foreach: iterate over a collection without loop counter. • multicore: run parallel computation on computers with multiple cores without explicit user request. • RHIPE: interface between R and Hadoop • BatchJobs: Map/Reduce functionality to HPC systems using Torque/PBS, SGE, LSF, etc. • gputools: common data-mining algorithms implemented using nVidia CUDA language/library • Many, many more at 23
  24. programming as a language

  25. R as a programming language • New tools/procedures can be

    written in R, shared, and used by others. • Open-source. - Don’t know what a function does? Look at the code yourself. - Don’t like how a function works? Hack the code and re-write how it works yourself. • R packages: Extend R with more functions, data, graphics. - CRAN: >5,000 packages - Bioconductor: ~1,000 packages 25
  26. R as a programming language 26 Integration with other tools

    • twitteR: integration with Twitter - • Call R from Python - • Python from R - http://rpython.r-forge.r- • Access a MySQL database (RMySQL) • Google Maps API (RgoogleMaps) • Interact with Garmin data / Strava API -
  27. R as a programming language • Reproducible research - Point

    & click interfaces are NOT reproducible. - R code is written in plain text file. Running same code on same data should reproduce exact results. - R “scripts” are easily shared. - Latex, Knitr: Allow seamless integration of R code into self- documenting report. 27
  28. 28 Demo: Reproducible Research with R

  29. Resources

  30. Resources 30 R Mailing List: ! Bioconductor Mailing

  31. Resources 31 Programming Q&A Site. Over 40,000 questions tagged with

    “R”: CrossValidated Statistics Q&A Site. Over 1,000 questions tagged with “R”:
  32. Resources 32 Computing for Data Analysis ! R Programming: Roger Peng: All videos on YouTube:
  33. Resources 33 TryR: A short, interactive course to let you

    jump right in. Learn and run code right in the browser. A custom Google search engine for R-related topics.
  34. Resources 34 Editor Console Workspace Graphics RStudio: A beautiful, free,

    full-featured IDE.
  35. Resources • Quick-R: short examples, code: • University Resources: - - - • Find the right package: - CRAN Tasks: - Bioconductor: - CRANtastic: • Cheat Sheets: - - • Aggregated feed of 450 R blogs: • More: 35
  36. Local Resources 36 StatLab PHS@HSL Stats Questions?

  37. Local Resources 37 Web: E-Mail: Blog: Twitter:

    @genetics_blog Facebook:
  38. Local Resources Turner Arnold Ragon & Harrison T A R