Slide 1

Slide 1 text

@JennyBryan @jennybc spreadsheets @STAT545 http://stat545.com

Slide 2

Slide 2 text

relevant links, credits, and slides: https://github.com/jennybc/2016-06_spreadsheets

Slide 3

Slide 3 text

Rich FitzJohn Research Software Engineer University College London @rgfitzjohn @richfitz

Slide 4

Slide 4 text

spreadsheets: a dystopian moonscape of unrecorded user actions — Gordon Shotwell

Slide 5

Slide 5 text

some of my best friends use spreadsheets

Slide 6

Slide 6 text

I supported myself for ~4 years doing spreadsheets

Slide 7

Slide 7 text

~1 billion use Microsoft Office ~650 million use spreadsheets >50% use formulas 1 - 5 million people use Python 250K - 1 million people use R

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

you go into data analysis with the tools you know, not the tools you need

Slide 10

Slide 10 text

spreadsheets combine: data logic figures formatted tables + reactivity

Slide 11

Slide 11 text

spreadsheets users use workbooks like I would use a data analytic git repo

Slide 12

Slide 12 text

a data analytic project: data .R, .Rmd .png, .svg .md, .html, .pdf, Shiny app + build and deploy

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

syntax bullshittery

Slide 15

Slide 15 text

spreadsheets are not going away deal with it

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

what you THINK people are doing != what you think people SHOULD be doing != what people ARE ACTUALLY doing

Slide 20

Slide 20 text

The Enron Corpus 600K emails > 15K spreadsheets ~ 80K worksheets

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

from Hermans, Murphy-Hill

Slide 24

Slide 24 text

data in formatting

Slide 25

Slide 25 text

small multiples

Slide 26

Slide 26 text

data in formulas =(3.6946*10^-6)/'Old snails'!J26

Slide 27

Slide 27 text

data in (merged) column headers

Slide 28

Slide 28 text

My workbook has 17 sheets, but I transposed the data matrix in sheet 4, at random, for absolutely no reason.

Slide 29

Slide 29 text

We have formulas that refer to cells in the other. But you will only ever get one of us.

Slide 30

Slide 30 text

Columns of intermediate computations are so boring. I like to hide them!

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

machine readable & human readable

Slide 33

Slide 33 text

code can be machine & human readable

Slide 34

Slide 34 text

data can be machine & human readable

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

a spreadsheet is often neither machine nor human readable

Slide 37

Slide 37 text

programming logic data formatting

Slide 38

Slide 38 text

what are the problems? which ones can we solve? via training via tooling be realistic, be fair, be precise

Slide 39

Slide 39 text

let 1,000 flowers bloom!

Slide 40

Slide 40 text

Two angles on the Spreadsheet Problem: Create new spreadsheet implementations that use, e.g., R for computation and visualization. Accept spreadsheets as they are. Create tools to get goodies out and into, e.g., R. Maybe write back into sheets?

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

AlphaSheets “collaborative, programmable spreadsheets”

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

~150 lines of code later …

Slide 47

Slide 47 text

readxl: CRAN, GitHub openxlsx: CRAN, GitHub XLConnect: CRAN, GitHub xlsx: CRAN, GitHub gdata: CRAN, R-Forge … and more

Slide 48

Slide 48 text

What do we want? no tricky dependency … no Java agnostic re: Excel, Google Sheet, ill-formed csv expose (unformatted) data (unevaluated) formulas formatting detect / propose views handle merged cells, weird headers

Slide 49

Slide 49 text

How are we doing it? define the linen object = spreadsheet receptacle document meta-data worksheet meta-data cell data, broadly defined rexcel & googlesheets create linen objects simple? return a data frame! not? expose linen object for more processing …

Slide 50

Slide 50 text

rexcel googlesheets data frame data frame Sheets API v3 (XML) Google Apps Script / Sheets API v4 (JSON) (XML) linen workbook worksheets cells

Slide 51

Slide 51 text

rexcel googlesheets linen workbook worksheet cell jailbreakr multiple views, data frames unformatted data, formatting unevaluated formulas figures?

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

https://github.com/rsheets

Slide 60

Slide 60 text

rexcel googlesheets data frame jailbreakr multiple data frames formulas, formatting, figures? raw object linen cellranger data frame raw object

Slide 61

Slide 61 text

bonus content

Slide 62

Slide 62 text

googlesheets

Slide 63

Slide 63 text

No content

Slide 64

Slide 64 text

default read does not necessarily give you what you want with numeric formatting and formulas

Slide 65

Slide 65 text

cf <- gs_read_cellfeed(gs_ff()) cf %>% filter(row > 1, col == 2) %>% select(value, input_value, numeric_value) %>% readr::type_convert() #> #> value input_value numeric_value #> #> 1 654,321 6.543210e+05 6.543210e+05 #> 2 12.34% 1.234000e+01 1.234000e-01 #> 3 1.23E+09 1.234568e+09 1.234568e+09 #> 4 3 1/7 3.141593e+00 3.141593e+00 #> 5 $0.36 3.600000e-01 3.600000e-01

Slide 66

Slide 66 text

cf <- gs_read_cellfeed(gs_ff()) cf %>% filter(row > 1, col == 3) %>% select(value, input_value, numeric_value) %>% readr::type_convert() #> #> value input_value numeric_value #> #> 1 1.23 1.2345 1.2345 #> 2 2.35 2.3456 2.3456 #> 3 3.46 3.4567 3.4567 #> 4 4.57 4.5678 4.5678 #> 5 5.68 5.6789 5.6789

Slide 67

Slide 67 text

cf <- gs_read_cellfeed(gs_ff()) cf %>% filter(row > 1, col == 5) %>% select(value, input_value, numeric_value) %>% mutate(input_value = substr(input_value, 1, 43)) %>% readr::type_convert() #> #> value input_value numeric_value #> #> 1 Google =HYPERLINK("http://www.google.com/","Google NA #> 2 1,271,591.00 =sum(R[-1]C[-4]:R[3]C[-4]) 1271591 #> 3 =IMAGE("https://www.google.com/images/srpr/ NA #> 4 $A$1 =ADDRESS(1,1) NA #> 5 =SPARKLINE(R[-4]C[-4]:R[0]C[-4]) NA

Slide 68

Slide 68 text

cf <- gs_read_cellfeed(gs_ff()) cf %>% filter(row > 1, col == 6) %>% select(value, input_value, numeric_value) %>% readr::type_convert() #> #> value input_value numeric_value #> #> 1 3.18E+05 =average(R[0]C[-5]:R[4]C[-5]) 3.178978e+05 #> 2 52.63% =R[-1]C[-5]/R[1]C[-5] 5.263144e-01 #> 3 0.22 =R[-2]C[-5]/R[2]C[-5] 2.173942e-01 #> 4 123,456.00 =min(R[-3]C[-5]:R[1]C[-5]) 1.234560e+05 #> 5 317,898 =average(R2C1:R6C1) 3.178978e+05