Laboratories University of British Columbia [email protected] https://github.com/jennybc http://www.stat.ubc.ca/~jenny/ @JennyBryan ← personal, professional Twitter https://github.com/STAT545-UBC http://stat545-ubc.github.io @STAT545 ← Twitter as lead instructor of this course
http://jakevdp.github.io/blog/2013/10/26/big-data-brain-drain/ in a wide array of academic fields, the ability to effectively process data is superseding other more classical modes of research
mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets. ... what data scientists call “data wrangling,” “data munging” and “data janitor work” ...
indeed, profitable, attention has been placed on the value of cleaning a data set. David Mimno unpicks the term and the process and suggests that data carpentry may be a more suitable description. There is no such thing as pure or clean data buried in a thin layer of non-clean data. In reality, the process is more like deciding how to cut and join a piece of material. Data carpentry is not a single process but a thousand little skills and techniques. http://blogs.lse.ac.uk/impactofsocialsciences/2014/09/01/data-carpentry-skilled-craft-data-science/
with computers, be sure to plot it Problem first not solution backward http://simplystatistics.org/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis/ http://simplystatistics.org/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis/ ... the first reasonable thing you can do to a set of data often is 80% of the way to the optimal solution. Everything after that is working on getting the last 20%... (maybe could even be the 90/10 rule) http://simplystatistics.org/2014/03/20/the-8020-rule-of-statistical-methods-development/
Robustness in the strategy of scientific model building, in Robustness in Statistics, R.L. Launer and G.N. Wilkinson, Editors. 1979, Academic Press: New York. Entia non sunt multiplicanda praeter necessitatem The principle, known as Occam’s Razor, that says: when there are two competing theories or explanations -- both compatible with observed data, known facts -- the simpler one is better. Implication for statistical analysis: if two models are equally wrong-but-compatible-with-data, the simpler one is more useful!
but I stand by this claim http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html http://qz.com/81661/most-data-isnt-big-and-businesses-are-wasting-money-pretending-it-is/ https://www.facebook.com/dan.ariely/posts/904383595868 “Big data has become a synonym for ‘data analysis,’ which is confusing and counter-productive.” “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.” “Too big for Excel is not ‘Big Data’.”
philosophy for data analysis, esp. exploratory and descriptive analysis. • Help you assemble a modern toolchain and workflows for data analysis. You’ll leave this course with (at least the beginnings of) a confident, deliberate attitude about how to approach data analysis and the practical skills to put your attitude into action. My hope:
Edward B. Fowlkes; Bruce Hoadley. Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure. JASA, Vol. 84, No. 408 (Dec., 1989), pp. 945-957. Access via JSTOR.
and Narrative Ch. 5 deals with the Challenger disaster That chapter is available for $7 as a downloadable booklet: http://www.edwardtufte.com/tufte/books_textb
plot the data. Replace (or complement) ‘typical’ tables of data or statistical results with figures that are more compelling and accessible. Whenever possible, generate figures that overlay / juxtapose observed data and analytical results, e.g. the ‘fit’.
data and results when it is least embarrassing and painful •facilitate comparisons and reveal trends Recommended reference: Gelman A, Pasarica C, Dodhia R. “Let's Practice What We Preach: Turning Tables into Graphs”. The American Statistician, Volume 56, Number 2, 1 May 2002 , pp. 121-130(10). via JSTOR
/ back up / archive collaboration / open science Sweave knitr R markdown R packages GitHub Rforge sourceforge git subversion mercurial What the cool kids seem to be doing .... RStudio
would replace Emacs + ESS or Tinn-R, but not R itself Rstudio is a product of -- actually, more a driver of -- the emergence of R Markdown, knitr, R + Git(Hub)
the Future of Statistics Web PDF → HTML Latex → Markdown Static → Interactive 3 Open Open source ↑↑ Open science ↑↑ Open research ↑↑ Wednesday, October 30, 13
## Medium header (header 2, actually) It's easy to do *italics* or __make things bold__. > All models are wrong, but some are useful. An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem. Absolute certainty is a privilege of uneducated minds-and fanatics. It is, for scientific folk, an unattainable ideal. What you do every day matters more than what you do once in a while. We cannot expect anyone to know anything we didn't teach them ourselves. Enthusiasm is a form of social courage. Code block below. Just affects formatting here but we'll get to R Markdown for the real fun soon! ``` x <- 3 * 4 ``` I can haz equations. Inline equations, such as ... the average is computed as $\frac{1}{n} \sum_{i=1} ^{n} x_{i}$. Or display equations like this: $$ \begin{equation*} |x|= \begin{cases} x & \text{if $x≥0$,} \\\\ -x &\text{if $x\le 0$.} \end{cases} \end{equation*} $$ Markdown HTML
```{r} x <- rnorm(1000) head(x) ``` See how the R code gets executed and a representation thereof appears in the document? `knitr` gives you control over how to represent all conceivable types of output. In case you care, then average of the `r length(x)` random normal variates we just generated is `r round(mean(x), 3)`. Those numbers are NOT hard-wired but are computed on-the- fly. As is this figure. No more copy-paste ... copy- paste ... oops forgot to copy-paste. ```{r} plot(density(x)) ``` Note that all the previously demonstrated math typesetting still works. You don't have to choose between having math cred and being web-friendly! Inline equations, such as ... the average is computed as $\frac{1}{n} \sum_{i=1}^{n} x_{i}$. Or display equations like this: $$ \begin{equation*} |x|= \begin{cases} x & \text{if $x≥0$,} \\\\ -x &\text{if $x\le 0$.} \end{cases} \end{equation*} $$ R Markdown rocks ===================================== This is an R Markdown document. ```r x <- rnorm(1000) head(x) ``` ``` ## [1] -1.3007 0.7715 0.5585 -1.2854 1.1973 2.4157 ``` See how the R code gets executed and a representation thereof appears in the document? `knitr` gives you control over how to represent all conceivable types of output. In case you care, then average of the 1000 random normal variates we just generated is -0.081. Those numbers are NOT hard- wired but are computed on-the-fly. As is this figure. No more copy-paste ... copy-paste ... oops forgot to copy-paste. ```r plot(density(x)) ``` ![plot of chunk unnamed-chunk-2](figure/unnamed- chunk-2.png) ... R Markdown Markdown
```r x <- rnorm(1000) head(x) ``` ``` ## [1] -1.3007 0.7715 0.5585 -1.2854 1.1973 2.4157 ``` See how the R code gets executed and a representation thereof appears in the document? `knitr` gives you control over how to represent all conceivable types of output. In case you care, then average of the 1000 random normal variates we just generated is -0.081. Those numbers are NOT hard- wired but are computed on-the-fly. As is this figure. No more copy-paste ... copy-paste ... oops forgot to copy-paste. ```r plot(density(x)) ``` ![plot of chunk unnamed-chunk-2](figure/unnamed- chunk-2.png) ... Markdown HTML
the world or select collaborators? maybe I’ll share my data and my prose too .... how should I marshall all of that stuff? how can I collaborate with others on an analysis or package development? Advice to preserve sanity: Stop doing this via email, attachments, and tracking changes in Word. Get that stuff into plain text, put it under version control and get it out on the web.
people develop software Git, in particular, is being “repurposed” for activities other than pure software development ... like the messy hybrid of writing, coding and data wrangling Git repository = a bunch of files you want to manage in a sane way repo = repository
has been -- and continues to be -- painful. But not nearly as crazy-making as the alternatives: - documents as email attachments - uncertainty about which version is “master” - am I working with the most recent data? - archaelogical “digs” on old email threads - uncertainty about how/if certain changes have been made or issues solved - hair-raising ZIP archives containing file salad
/ back up / archive collaboration / open science knitr R markdown R packages GitHub Rforge sourceforge git subversion mercurial Now you what the fuss is about! RStudio
tool that is being repurposed to meet a need in data-intensive workflows. Originally intended to orchestrate compiling complicated software, it’s now used to express what depends on what and keep everything “in sync”.
giving you a fish) - It’s amazing what a determined individual can learn from documentation, small learning examples, and ... <gasp> Googling. And also stackoverflow. • Rewarding engagement, intellectual generosity and curiosity - Speaking up, sharing success OR failure, showing some interest in something will earn marks. • Zero tolerance of plagiarism - Generating your own approach, writing some code, and describing the process is the whole point. Process is generally more important than product.
(think check, check minus, check plus), with peer evaluation • Eventually, flexibility to work with a dataset you choose or to spin the problem a certain way - Think about datasets you’d like to prepare and analyze! • Adjust the difficulty level relative to where you are now (and where you need/want to be!) • Peer review; marked coarsely (good review vs. “needs more”) • Engagement and participation: in class, in our GitHub world
Wed this week only) Open an issue on a GitHub repository and tag one or more instructors Tweet to @STAT545 Direct twitter message to @STAT545 Email to an instructor http://stat545-ubc.github.io/help-STAT545.html