Slide 1

Slide 1 text

Introduction to R Stephen D. Turner, Ph.D. Bioinformatics Core Director 1 Slides available at stephenturner.us/slides

Slide 2

Slide 2 text

Why Because R is awesome. R?

Slide 3

Slide 3 text

R is FREE. 3 Free, as in beer. Free, as in speech.

Slide 4

Slide 4 text

R is FREE. 4 Software Cost $1,140 - $4,370 + maintenance $8,700 - $140,000 / year $2,390 - $40,600 / year $2,150 + $1,000s for modules $0

Slide 5

Slide 5 text

R Community

Slide 6

Slide 6 text

R Community 6 NYT: R is the “lingua franca” of data analysts inside corporations and academia. Norman Nie, scholar and co- founder of SPSS: R is “the most powerful and flexible statistical programming language in the world.”

Slide 7

Slide 7 text

R Community 7 KDNuggets Poll Languages used for analytics, data mining, data science. http://www.kdnuggets.com/2013/08/languages-for-analytics-data-mining-data-science.html http://r4stats.com/

Slide 8

Slide 8 text

R Community 8 CRAN = Comprehensive R Archive Network ! http://cran.us.r-project.org/ ! Over 5,000 free add-on packages.

Slide 9

Slide 9 text

R Community • Nearly 1000 free packages for bioinformatics analysis using R. • NGS analysis: - Manipulate: import FASTQ/bam, trim, transform, align, manipulate sequences, … - Applications: Quality Assessment, ChIP-seq, differential expression, RNA-seq, much more. - Annotation: gene, pathway, GO, homology, … Access GO, KEGG, NCBI, Biomart, UCSC, … • Much, much more: flow cytometry, DNA methylation, microarrays, TFBS analysis, eQTL analysis, functional annotation, … • BioC Community: Conferences (since 2002), mailing list, … • http://bioconductor.org/ 9

Slide 10

Slide 10 text

R Community 10 Companies using R: revolutionanalytics.com/companies-using-r

Slide 11

Slide 11 text

Amazing Graphics edwardtufte.com

Slide 12

Slide 12 text

qplot(carat, price, data=diamonds, facets=clarity~color) GettingGeneticsDone.com

Slide 13

Slide 13 text

manhattan(data, annotate=snps) github.com/stephenturner/qqman

Slide 14

Slide 14 text

http://nyti.ms/1ff2bWa

Slide 15

Slide 15 text

http://nyti.ms/1pvhu2t

Slide 16

Slide 16 text

“Visualizing Friendships” http://on.fb.me/MlI0NI

Slide 17

Slide 17 text

The Arteries of the World, in Tweets https://blog.twitter.com/2013/the-geography-of-tweets

Slide 18

Slide 18 text

http://spatial.ly/

Slide 19

Slide 19 text

Wind speeds. Inspired by http://hint.fm/wind/, currently on display at NYC MOMA. @cambecc Prosperity in France. http://coulmont.com/blog/ 2011/12/11/ah-36-000-communes/ (Article in French). @coulmont

Slide 20

Slide 20 text

K-means Clustering 86 Single Malt Scotch Whiskies http://blog.revolutionanalytics.com/2013/12/k-means- clustering-86-single-malt-scotch-whiskies.html

Slide 21

Slide 21 text

BIG DATA and

Slide 22

Slide 22 text

R and Big Data • What’s big data? - Too large to process using traditional processing applications 
 — Wikipedia - “Volume, velocity, variety” 
 — Doug Laney, 2001 - “When computing the answer takes longer than the cognitive process of designing the model” 
 — Hadley Wickham, R developer 22

Slide 23

Slide 23 text

R and Big Data • ff: access datasets too large to fit into memory • bigmemory: store large objects in memory and files with external pointer, enabling transparent access from R to large objects. • pbdMPI: Interface to MPI • pbdNCDF4: multiple processes can read/write same file • snow (simple network of workstations): abstraction layer, hiding communication details from parallelized processed. • foreach: iterate over a collection without loop counter. • multicore: run parallel computation on computers with multiple cores without explicit user request. • RHIPE: interface between R and Hadoop • BatchJobs: Map/Reduce functionality to HPC systems using Torque/PBS, SGE, LSF, etc. • gputools: common data-mining algorithms implemented using nVidia CUDA language/library • Many, many more at http://cran.r-project.org/web/views/HighPerformanceComputing.html 23

Slide 24

Slide 24 text

programming as a language

Slide 25

Slide 25 text

R as a programming language • New tools/procedures can be written in R, shared, and used by others. • Open-source. - Don’t know what a function does? Look at the code yourself. - Don’t like how a function works? Hack the code and re-write how it works yourself. • R packages: Extend R with more functions, data, graphics. - CRAN: >5,000 packages - Bioconductor: ~1,000 packages 25

Slide 26

Slide 26 text

R as a programming language 26 Integration with other tools • twitteR: integration with Twitter - github.com/stephenturner/twitterchive • Call R from Python - http://rpy.sourceforge.net/ • Python from R - http://rpython.r-forge.r- project.org/ • Access a MySQL database (RMySQL) • Google Maps API (RgoogleMaps) • Interact with Garmin data / Strava API - github.com/stephenturner/trailprofile

Slide 27

Slide 27 text

R as a programming language • Reproducible research - Point & click interfaces are NOT reproducible. - R code is written in plain text file. Running same code on same data should reproduce exact results. - R “scripts” are easily shared. - Latex, Knitr: Allow seamless integration of R code into self- documenting report. 27

Slide 28

Slide 28 text

28 Demo: Reproducible Research with R

Slide 29

Slide 29 text

Resources

Slide 30

Slide 30 text

Resources 30 http://blog.revolutionanalytics.com/ R Mailing List: http://www.r-project.org/mail.html ! Bioconductor Mailing List: http://www.bioconductor.org/help/mailing-list/

Slide 31

Slide 31 text

Resources 31 Programming Q&A Site. Over 40,000 questions tagged with “R”: http://stackoverflow.com/ CrossValidated Statistics Q&A Site. Over 1,000 questions tagged with “R”: http://stats.stackexchange.com/

Slide 32

Slide 32 text

Resources 32 Computing for Data Analysis https://www.coursera.org/course/compdata ! R Programming: https://www.coursera.org/course/rprog Roger Peng: All videos on YouTube: http://www.youtube.com/user/rdpeng/videos

Slide 33

Slide 33 text

Resources 33 TryR: A short, interactive course to let you jump right in. Learn and run code right in the browser. http://tryr.codeschool.com/ http://www.rseek.org/ A custom Google search engine for R-related topics.

Slide 34

Slide 34 text

Resources 34 Editor Console Workspace Graphics RStudio: A beautiful, free, full-featured IDE. http://www.rstudio.com/

Slide 35

Slide 35 text

Resources • Quick-R: short examples, code:http://www.statmethods.net/ • University Resources: - http://www.ats.ucla.edu/stat/r/ - http://data.princeton.edu/R/gettingStarted.html - http://biostat.mc.vanderbilt.edu/wiki/Main/RS • Find the right package: - CRAN Tasks: http://cran.r-project.org/web/views/ - Bioconductor: http://www.bioconductor.org/packages/2.13/bioc/ - CRANtastic: http://crantastic.org/ • Cheat Sheets: - http://cran.r-project.org/doc/contrib/Short-refcard.pdf - http://cran.r-project.org/doc/contrib/Baggott-refcard-v2.pdf • Aggregated feed of 450 R blogs: http://www.r-bloggers.com/ • More: http://www.revolutionanalytics.com/r-language-resources 35

Slide 36

Slide 36 text

Local Resources 36 StatLab statlab.library.virginia.edu bioconnector.virginia.edu/phs PHS@HSL Stats Questions?

Slide 37

Slide 37 text

Local Resources 37 Web: bioinformatics.virginia.edu E-Mail: bioinformatics@virginia.edu Blog: GettingGeneticsDone.com Twitter: @genetics_blog Facebook: facebook.com/UVABioinformaticsCore

Slide 38

Slide 38 text

Local Resources Turner Arnold Ragon & Harrison T A R H bioconnector.virginia.edu