Slide 1

Slide 1 text

STAT 545A Class meeting 001 Course intro + prompts to install lots of software and sign up for lots of accounts web companion: STAT 545 web home > Syllabus > cm001 Tuesday, September 8, 2015

Slide 2

Slide 2 text

Dr. Jennifer (Jenny) Bryan Department of Statistics and Michael Smith Laboratories University of British Columbia [email protected] https://github.com/jennybc http://www.stat.ubc.ca/~jenny/ @JennyBryan ← personal, professional Twitter https://github.com/STAT545-UBC http://stat545-ubc.github.io @STAT545 ← Twitter as lead instructor of this course

Slide 3

Slide 3 text

statistical theory real world data STAT 545A

Slide 4

Slide 4 text

The Big Data Brain Drain: Why Science is in Trouble http://jakevdp.github.io/blog/2013/10/26/big-data-brain-drain/ in a wide array of academic fields, the ability to effectively process data is superseding other more classical modes of research

Slide 5

Slide 5 text

what is data science?

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

The data science Venn diagram http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Slide 8

Slide 8 text

M unge Visualise M odel Communicate Tidy Question Collect Slides from Hadley Wickham's talk in the Simply Statistics Unconference http://t.co/D931Og8mq3 a horizontal data science workflow

Slide 9

Slide 9 text

M unge Visualise M odel Communicate Tidy Question Collect W ednesday, October 30, 13 Slides from Hadley Wickham's talk in the Simply Statistics Unconference http://t.co/D931Og8mq3 We can’t focus just on this!

Slide 10

Slide 10 text

http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?partner=rss&emc=rss&smid=tw-nytimesscience&_r=0 Data scientists spend 50 - 80% of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets. ... what data scientists call “data wrangling,” “data munging” and “data janitor work” ...

Slide 11

Slide 11 text

http://mimno.infosci.cornell.edu/b/articles/carpentry/ As data science becomes all the more relevant and indeed, profitable, attention has been placed on the value of cleaning a data set. David Mimno unpicks the term and the process and suggests that data carpentry may be a more suitable description. There is no such thing as pure or clean data buried in a thin layer of non-clean data. In reality, the process is more like deciding how to cut and join a piece of material. Data carpentry is not a single process but a thousand little skills and techniques. http://blogs.lse.ac.uk/impactofsocialsciences/2014/09/01/data-carpentry-skilled-craft-data-science/

Slide 12

Slide 12 text

Complexity must be justified. http://www.john-foreman.com/blog/the-forgotten-job-of-a-data-scientist-editing Before you analyze your data with computers, be sure to plot it Problem first not solution backward http://simplystatistics.org/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis/ http://simplystatistics.org/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis/ ... the first reasonable thing you can do to a set of data often is 80% of the way to the optimal solution. Everything after that is working on getting the last 20%... (maybe could even be the 90/10 rule) http://simplystatistics.org/2014/03/20/the-8020-rule-of-statistical-methods-development/

Slide 13

Slide 13 text

“All models are wrong, some models are useful.” Box, G.E.P., Robustness in the strategy of scientific model building, in Robustness in Statistics, R.L. Launer and G.N. Wilkinson, Editors. 1979, Academic Press: New York. Entia non sunt multiplicanda praeter necessitatem The principle, known as Occam’s Razor, that says: when there are two competing theories or explanations -- both compatible with observed data, known facts -- the simpler one is better. Implication for statistical analysis: if two models are equally wrong-but-compatible-with-data, the simpler one is more useful!

Slide 14

Slide 14 text

small medium big size of dataset * figure is fictional but I stand by this claim There are LOTS of small to medium datasets, even though it’s more trendy to talk about the big ones.

Slide 15

Slide 15 text

small medium big size of dataset * figure is fictional but I stand by this claim http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html http://qz.com/81661/most-data-isnt-big-and-businesses-are-wasting-money-pretending-it-is/ https://www.facebook.com/dan.ariely/posts/904383595868 “Big data has become a synonym for ‘data analysis,’ which is confusing and counter-productive.” “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.” “Too big for Excel is not ‘Big Data’.”

Slide 16

Slide 16 text

big data ≠ data science big data 㱬 data science

Slide 17

Slide 17 text

Two inter-related goals • Foster your development of a personal philosophy for data analysis, esp. exploratory and descriptive analysis. • Help you assemble a modern toolchain and workflows for data analysis. You’ll leave this course with (at least the beginnings of) a confident, deliberate attitude about how to approach data analysis and the practical skills to put your attitude into action. My hope:

Slide 18

Slide 18 text

you’re going to make your own data science sampler

Slide 19

Slide 19 text

“A picture is worth a thousand words”

Slide 20

Slide 20 text

http://msnbcmedia1.msn.com/j/msnbc/Components/Photos/050709/050609_columbia_hmed_6p.hmedium.jpg 1986 Challenger space shuttle disaster Favorite example of Edward Tufte

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

“A picture is worth a thousand words”

Slide 23

Slide 23 text

“A picture is worth a thousand words” Siddhartha R. Dalal; Edward B. Fowlkes; Bruce Hoadley. Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure. JASA, Vol. 84, No. 408 (Dec., 1989), pp. 945-957. Access via JSTOR.

Slide 24

Slide 24 text

Edward Tufte http://www.edwardtufte.com BOOK: Visual Explanations: Images and Quantities, Evidence and Narrative Ch. 5 deals with the Challenger disaster That chapter is available for $7 as a downloadable booklet: http://www.edwardtufte.com/tufte/books_textb

Slide 25

Slide 25 text

“A picture is worth a thousand words” Always, always, always plot the data. Replace (or complement) ‘typical’ tables of data or statistical results with figures that are more compelling and accessible. Whenever possible, generate figures that overlay / juxtapose observed data and analytical results, e.g. the ‘fit’.

Slide 26

Slide 26 text

“A picture is worth a thousand words” Why? •find bizarre data and results when it is least embarrassing and painful •facilitate comparisons and reveal trends Recommended reference: Gelman A, Pasarica C, Dodhia R. “Let's Practice What We Preach: Turning Tables into Graphs”. The American Statistician, Volume 56, Number 2, 1 May 2002 , pp. 121-130(10). via JSTOR

Slide 27

Slide 27 text

weak links in the chain: process, packaging and presentation

Slide 28

Slide 28 text

we watched a video about the importance of reproducible research, data and analytical documentation, data sharing https://www.youtube.com/watch?v=N2zK3sAtr-4&feature=youtu.be

Slide 29

Slide 29 text

project organization literate programming reproducible research version control / back up / archive collaboration / open science

Slide 30

Slide 30 text

project organization / literate programming / reproducible research version control / back up / archive collaboration / open science Sweave knitr R markdown R packages GitHub Rforge sourceforge git subversion mercurial What the cool kids seem to be doing .... RStudio

Slide 31

Slide 31 text

RStudio is an integrated development environment (IDE) for R

Slide 32

Slide 32 text

R ≠ RStudio RStudio mediates your interaction with R; it would replace Emacs + ESS or Tinn-R, but not R itself Rstudio is a product of -- actually, more a driver of -- the emergence of R Markdown, knitr, R + Git(Hub)

Slide 33

Slide 33 text

from Hadley Wickham’s talk in the Simply Statistics Unconference on the Future of Statistics Web PDF → HTML Latex → Markdown Static → Interactive 3 Open Open source ↑↑ Open science ↑↑ Open research ↑↑ Wednesday, October 30, 13

Slide 34

Slide 34 text

Markdown HTML foo.md foo.html easy to write (and read!) easy to publish easy to read in browser

Slide 35

Slide 35 text

Title (header 1, actually) ===================================== This is a Markdown document. ## Medium header (header 2, actually) It's easy to do *italics* or __make things bold__. > All models are wrong, but some are useful. An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem. Absolute certainty is a privilege of uneducated minds-and fanatics. It is, for scientific folk, an unattainable ideal. What you do every day matters more than what you do once in a while. We cannot expect anyone to know anything we didn't teach them ourselves. Enthusiasm is a form of social courage. Code block below. Just affects formatting here but we'll get to R Markdown for the real fun soon! ``` x <- 3 * 4 ``` I can haz equations. Inline equations, such as ... the average is computed as $\frac{1}{n} \sum_{i=1} ^{n} x_{i}$. Or display equations like this: $$ \begin{equation*} |x|= \begin{cases} x & \text{if $x≥0$,} \\\\ -x &\text{if $x\le 0$.} \end{cases} \end{equation*} $$ Markdown HTML

Slide 36

Slide 36 text

R markdown

Slide 37

Slide 37 text

R Markdown rocks ===================================== This is an R Markdown document. ```{r} x <- rnorm(1000) head(x) ``` See how the R code gets executed and a representation thereof appears in the document? `knitr` gives you control over how to represent all conceivable types of output. In case you care, then average of the `r length(x)` random normal variates we just generated is `r round(mean(x), 3)`. Those numbers are NOT hard-wired but are computed on-the- fly. As is this figure. No more copy-paste ... copy- paste ... oops forgot to copy-paste. ```{r} plot(density(x)) ``` Note that all the previously demonstrated math typesetting still works. You don't have to choose between having math cred and being web-friendly! Inline equations, such as ... the average is computed as $\frac{1}{n} \sum_{i=1}^{n} x_{i}$. Or display equations like this: $$ \begin{equation*} |x|= \begin{cases} x & \text{if $x≥0$,} \\\\ -x &\text{if $x\le 0$.} \end{cases} \end{equation*} $$ R Markdown rocks ===================================== This is an R Markdown document. ```r x <- rnorm(1000) head(x) ``` ``` ## [1] -1.3007 0.7715 0.5585 -1.2854 1.1973 2.4157 ``` See how the R code gets executed and a representation thereof appears in the document? `knitr` gives you control over how to represent all conceivable types of output. In case you care, then average of the 1000 random normal variates we just generated is -0.081. Those numbers are NOT hard- wired but are computed on-the-fly. As is this figure. No more copy-paste ... copy-paste ... oops forgot to copy-paste. ```r plot(density(x)) ``` ![plot of chunk unnamed-chunk-2](figure/unnamed- chunk-2.png) ... R Markdown Markdown

Slide 38

Slide 38 text

R Markdown rocks ===================================== This is an R Markdown document. ```r x <- rnorm(1000) head(x) ``` ``` ## [1] -1.3007 0.7715 0.5585 -1.2854 1.1973 2.4157 ``` See how the R code gets executed and a representation thereof appears in the document? `knitr` gives you control over how to represent all conceivable types of output. In case you care, then average of the 1000 random normal variates we just generated is -0.081. Those numbers are NOT hard- wired but are computed on-the-fly. As is this figure. No more copy-paste ... copy-paste ... oops forgot to copy-paste. ```r plot(density(x)) ``` ![plot of chunk unnamed-chunk-2](figure/unnamed- chunk-2.png) ... Markdown HTML

Slide 39

Slide 39 text

Markdown HTML foo.md foo.html easy to write (and read!) easy to publish easy to read in browser R Markdown foo.rmd

Slide 40

Slide 40 text

how do I put my source on the web? for the world or select collaborators? maybe I’ll share my data and my prose too .... how should I marshall all of that stuff? how can I collaborate with others on an analysis or package development? Advice to preserve sanity: Stop doing this via email, attachments, and tracking changes in Word. Get that stuff into plain text, put it under version control and get it out on the web.

Slide 41

Slide 41 text

http://www.phdcomics.com/comics/archive.php?comicid=1531 via Ram, 2013 doi:10.1186/1751-0473-8-7

Slide 42

Slide 42 text

Version control systems (VCS) were created to help groups of people develop software Git, in particular, is being “repurposed” for activities other than pure software development ... like the messy hybrid of writing, coding and data wrangling Git repository = a bunch of files you want to manage in a sane way repo = repository

Slide 43

Slide 43 text

collaboration = the “killer app” of version control Learning Git has been -- and continues to be -- painful. But not nearly as crazy-making as the alternatives: - documents as email attachments - uncertainty about which version is “master” - am I working with the most recent data? - archaelogical “digs” on old email threads - uncertainty about how/if certain changes have been made or issues solved - hair-raising ZIP archives containing file salad

Slide 44

Slide 44 text

Git repository = a bunch of files you want to manage in a sane way repo = repository you can set up repo ... then start your work or you can make a set of existing files and make them into a repo

Slide 45

Slide 45 text

GitHub = a place to host Git repositories on the web GitHub ≠ Git

Slide 46

Slide 46 text

possible, in theory more typical GitHub me you Image from https://www.atlassian.com/git/tutorial/git-basics#!clone ✗

Slide 47

Slide 47 text

Many R packages are developed in the open on GitHub Nice option when someone tells you to “read the source”!

Slide 48

Slide 48 text

You can see exactly how files have changed, when, and by whom. If commit message is good, you’ll see why. Commit = a formal “checkpoint” or snapshot of the state of the repository

Slide 49

Slide 49 text

GitHub provides a fantastic visual “diff” view of exactly what changed. Incredibly useful.

Slide 50

Slide 50 text

GitHub issues: think “bug tracker”, “to do list”.

Slide 51

Slide 51 text

GitHub renders Markdown files nicely Example: links.md in workshop repo of mine

Slide 52

Slide 52 text

You can see the raw Markdown too!

Slide 53

Slide 53 text

GitHub renders comma (.csv) and tab (.tsv) delimited files nicely Example: Lord of the Rings data I found for STAT 545A

Slide 54

Slide 54 text

http://www.wired.com/design/2013/08/how-segregated-is-your-city-this-eye-opening-map-shows-you/?viewall=true NYC

Slide 55

Slide 55 text

http://www.wired.com/design/2013/08/how-segregated-is-your-city-this-eye-opening-map-shows-you/?viewall=true NYC Detroit

Slide 56

Slide 56 text

http://www.coopercenter.org/demographics/Racial-Dot-Map

Slide 57

Slide 57 text

Cool result is accompanied by explanation of how it was done

Slide 58

Slide 58 text

https://github.com/unorthodox123/RacialDotMap

Slide 59

Slide 59 text

http://blog.revolutionanalytics.com/2013/08/foodborne-chicago.html

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

https://github.com/corynissen/foodborne_classifier

Slide 62

Slide 62 text

project organization / literate programming / reproducible research version control / back up / archive collaboration / open science knitr R markdown R packages GitHub Rforge sourceforge git subversion mercurial Now you what the fuss is about! RStudio

Slide 63

Slide 63 text

≪ source is real

Slide 64

Slide 64 text

Measured( Data Analy/c( Data Computa/onal( Results Ar/cle Tables Figures Numerical( Summaries Text Processing(code Analy/c(code Presenta/on(code slide(modified(from(Roger(Peng(hBp://www.biostat.jhsph.edu/~rpeng/ R((or(Python)(scripts

Slide 65

Slide 65 text

Measured( Data Analy/c( Data Computa/onal( Results Ar/cle Tables Figures Numerical( Summaries Text Processing(code Analy/c(code Presenta/on(code slide(modified(from(Roger(Peng(hBp://www.biostat.jhsph.edu/~rpeng/ delimited(or(other(structured,(agnos/c(files

Slide 66

Slide 66 text

Measured( Data Analy/c( Data Computa/onal( Results Ar/cle Tables Figures Numerical( Summaries Text Processing(code Analy/c(code Presenta/on(code slide(modified(from(Roger(Peng(hBp://www.biostat.jhsph.edu/~rpeng/ How(to(make(end(products(more(integrated(and( more(reproducible?

Slide 67

Slide 67 text

Measured( Data Analy/c( Data Computa/onal( Results Ar/cle Tables Figures Numerical( Summaries Text Processing(code Analy/c(code Presenta/on(code slide(modified(from(Roger(Peng(hBp://www.biostat.jhsph.edu/~rpeng/ How(to(keep(everything(upPtoPdate?( If(the(data(changes,(how(do(we(remember(to(reP make(the(figures(2B(and(4?

Slide 68

Slide 68 text

http://zmjones.com/make.html Makefile Like Git, GNU Make is another old school tool that is being repurposed to meet a need in data-intensive workflows. Originally intended to orchestrate compiling complicated software, it’s now used to express what depends on what and keep everything “in sync”.

Slide 69

Slide 69 text

course stuff

Slide 70

Slide 70 text

who am I? BA in econ and german management consultant PhD biostatistics assoc prof @ UBC 50% Statistics / 50% Michael Smith Laboratories I teach and perform lots of data analysis

Slide 71

Slide 71 text

Bernhard Konrad Dean Attali Luolan (Gloria) Li Jenny Bryan Julia Gustavsen Shaun Jackman http://stat545-ubc.github.io/people.html 2014 2014 2014 2015 2014 2014 2014 Andrew MacDonald 2015 team is being assembled!

Slide 72

Slide 72 text

Culture of the class • Teaching you to fish (vs. giving you a fish) - It’s amazing what a determined individual can learn from documentation, small learning examples, and ... Googling. And also stackoverflow. • Rewarding engagement, intellectual generosity and curiosity - Speaking up, sharing success OR failure, showing some interest in something will earn marks. • Zero tolerance of plagiarism - Generating your own approach, writing some code, and describing the process is the whole point. Process is generally more important than product.

Slide 73

Slide 73 text

Where marks will come from • Weekly homework; marked coarsely (think check, check minus, check plus), with peer evaluation • Eventually, flexibility to work with a dataset you choose or to spin the problem a certain way - Think about datasets you’d like to prepare and analyze! • Adjust the difficulty level relative to where you are now (and where you need/want to be!) • Peer review; marked coarsely (good review vs. “needs more”) • Engagement and participation: in class, in our GitHub world

Slide 74

Slide 74 text

No content

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

No content

Slide 78

Slide 78 text

http://stat545-ubc.github.io

Slide 79

Slide 79 text

https://twitter.com/STAT545

Slide 80

Slide 80 text

No content

Slide 81

Slide 81 text

No content

Slide 82

Slide 82 text

Homework! web companion: STAT 545 web home > Syllabus > cm001

Slide 83

Slide 83 text

Twitter, GitHub ... are PUBLIC cultivate your professional / scholarly profile with intention if you join, make sure @STAT545 follows you back

Slide 84

Slide 84 text

how to get help Office hours Tues/Thurs after class (or Wed this week only) Open an issue on a GitHub repository and tag one or more instructors Tweet to @STAT545 Direct twitter message to @STAT545 Email to an instructor http://stat545-ubc.github.io/help-STAT545.html

Slide 85

Slide 85 text

respond to our prompt to figure out who you are! we want to match up various info with what UBC provides (e.g. Twitter handle, Github username)

Slide 86

Slide 86 text

what class meetings will look like … sort of? 9:30 - 9:50 “lecture” 9:50 - 10:35 hands-on work! 10:35 - 10:55 “lecture”

Slide 87

Slide 87 text

rhythm of each week submit a unit of work tues class thurs class work on your own work in class consult peers and instructors in class office hrs online interaction via GitHub, Twitter

Slide 88

Slide 88 text

the end