Data rectangling in R: a journey from JSON to CSV

Slide 1

Slide 1 text

About me Sensory scien st @ Sensolu on.ID Instructor @ R Academy Telkom University Ini ator of Komunitas R Indonesia (est. 13 August 2016)  : sensehubr, nusandata, bandungjuara, prakiraan, etc  : sensehub, thermostats, aquastats, bcrp, bandungjuara, etc  : aswansyahputra  : aswansyahputra_  : aswansyahputra  : aswansyahputra 2 / 20 Know your neighbours! Who are you? What do you do with data? How do you describe your experience with R? 3 / 20 Know your neighbours! Who are you? What do you do with data? How do you describe your experience with R? 03:00 3 / 20 Let's play with some basics! 4 / 20 (x1 "useR! Yogyakarta") #> [1] "useR! Yogyakarta" (x2 TRUE) #> [1] TRUE (x3 1.43) #> [1] 1.43 (x4 1L:5L) #> [1] 1 2 3 4 5 Can you guess the type of x1, x2, x3, and x4? How about their length? Let's play with some basics! 4 / 20 (x1 "useR! Yogyakarta") #> [1] "useR! Yogyakarta" (x2 TRUE) #> [1] TRUE (x3 1.43) #> [1] 1.43 (x4 1L:5L) #> [1] 1 2 3 4 5 Can you guess the type of x1, x2, x3, and x4? How about their length? typeof(x1) #> [1] "character" length(x1) #> [1] 1 typeof(x2) #> [1] "logical" length(x2) #> [1] 1 typeof(x3) #> [1] "double" length(x3) #> [1] 1 What about x4? Let's play with some basics! 4 / 20 How to combine x1, x2, x3, and x4 without losing their properties? 5 / 20 It seems off, doesn't it? Can you explain? The use of c() (xs_c c(x1, x2, x3, x4)) #> [1] "useR! Yogyakarta" "TRUE" #> [4] "1" "2" #> [7] "4" "5" 6 / 20 It seems off, doesn't it? Can you explain? Let's check it! length(xs_c) #> [1] 8 typeof(xs_c) #> [1] "character" length ❌, type ❓What's happening? The use of c() (xs_c c(x1, x2, x3, x4)) #> [1] "useR! Yogyakarta" "TRUE" #> [4] "1" "2" #> [7] "4" "5" 6 / 20 (xs_list list(x1, x2, x3, x4)) #> [[1]] #> [1] "useR! Yogyakarta" #> #> [[2]] #> [1] TRUE #> #> [[3]] #> [1] 1.43 #> #> [[4]] #> [1] 1 2 3 4 5 Hmm, not so familiar but it seems to be what we wanted, right? The use of list() 7 / 20 (xs_list list(x1, x2, x3, x4)) #> [[1]] #> [1] "useR! Yogyakarta" #> #> [[2]] #> [1] TRUE #> #> [[3]] #> [1] 1.43 #> #> [[4]] #> [1] 1 2 3 4 5 Hmm, not so familiar but it seems to be what we wanted, right? Let's also check it! length(xs_list) #> [1] 4 typeof(xs_list) #> [1] "list" length ✅, type ❓ What is list? The use of list() 7 / 20 Hold on! 8 / 20 Hold on! How to check if the type and length of each element are preserved? 8 / 20 The good old for loop types_xs_c vector("character", length = length(xs_c)) for (i in seq_along(xs_c)) { types_xs_c[[i]] typeof(xs_c[[i]]) } types_xs_c #> [1] "character" "character" "character" "character" "character" "character" #> [7] "character" "character" 9 / 20 The good old for loop types_xs_c vector("character", length = length(xs_c)) for (i in seq_along(xs_c)) { types_xs_c[[i]] typeof(xs_c[[i]]) } types_xs_c #> [1] "character" "character" "character" "character" "character" "character" #> [7] "character" "character" lengths_xs_c vector("integer", length = length(xs_c)) for (i in seq_along(xs_c)) { lengths_xs_c[[i]] length(xs_c[[i]]) } lengths_xs_c #> [1] 1 1 1 1 1 1 1 1 9 / 20 The good old for loop types_xs_c vector("character", length = length(xs_c)) for (i in seq_along(xs_c)) { types_xs_c[[i]] typeof(xs_c[[i]]) } types_xs_c #> [1] "character" "character" "character" "character" "character" "character" #> [7] "character" "character" lengths_xs_c vector("integer", length = length(xs_c)) for (i in seq_along(xs_c)) { lengths_xs_c[[i]] length(xs_c[[i]]) } lengths_xs_c #> [1] 1 1 1 1 1 1 1 1 How would you perform the same procedure for xs_list? Save your results as types_xs_list and lengths_xs_list! 9 / 20 Let me introduce you to functional vapply(xs_c, typeof, character(1), USE.NAMES = FALSE) #> [1] "character" "character" "character" "character" "character" "character" #> [7] "character" "character" vapply(xs_c, length, integer(1), USE.NAMES = FALSE) #> [1] 1 1 1 1 1 1 1 1 10 / 20 Let me introduce you to functional vapply(xs_c, typeof, character(1), USE.NAMES = FALSE) #> [1] "character" "character" "character" "character" "character" "character" #> [7] "character" "character" vapply(xs_c, length, integer(1), USE.NAMES = FALSE) #> [1] 1 1 1 1 1 1 1 1 vapply(xs_list, typeof, character(1), USE.NAMES = FALSE) #> [1] "character" "logical" "double" "integer" vapply(xs_list, length, integer(1), USE.NAMES = FALSE) #> [1] 1 1 1 5 10 / 20 Let me introduce you to functional vapply(xs_c, typeof, character(1), USE.NAMES = FALSE) #> [1] "character" "character" "character" "character" "character" "character" #> [7] "character" "character" vapply(xs_c, length, integer(1), USE.NAMES = FALSE) #> [1] 1 1 1 1 1 1 1 1 vapply(xs_list, typeof, character(1), USE.NAMES = FALSE) #> [1] "character" "logical" "double" "integer" vapply(xs_list, length, integer(1), USE.NAMES = FALSE) #> [1] 1 1 1 5 Ok, it surely looks simpler but s ll... 10 / 20 Let me introduce you to functional vapply(xs_c, typeof, character(1), USE.NAMES = FALSE) #> [1] "character" "character" "character" "character" "character" "character" #> [7] "character" "character" vapply(xs_c, length, integer(1), USE.NAMES = FALSE) #> [1] 1 1 1 1 1 1 1 1 vapply(xs_list, typeof, character(1), USE.NAMES = FALSE) #> [1] "character" "logical" "double" "integer" vapply(xs_list, length, integer(1), USE.NAMES = FALSE) #> [1] 1 1 1 5 Ok, it surely looks simpler but s ll... library(purrr) map_chr(xs_list, typeof) #> [1] "character" "logical" "double" "integer" map_int(xs_list, length) #> [1] 1 1 1 5 10 / 20 Let me introduce you to functional vapply(xs_c, typeof, character(1), USE.NAMES = FALSE) #> [1] "character" "character" "character" "character" "character" "character" #> [7] "character" "character" vapply(xs_c, length, integer(1), USE.NAMES = FALSE) #> [1] 1 1 1 1 1 1 1 1 vapply(xs_list, typeof, character(1), USE.NAMES = FALSE) #> [1] "character" "logical" "double" "integer" vapply(xs_list, length, integer(1), USE.NAMES = FALSE) #> [1] 1 1 1 5 Ok, it surely looks simpler but s ll... library(purrr) map_chr(xs_list, typeof) #> [1] "character" "logical" "double" "integer" map_int(xs_list, length) #> [1] 1 1 1 5 So much simpler and be er, isn't it? 10 / 20 list resembles JSON very much! 11 / 20 list resembles JSON very much! Have a look at following comparison using subset of billionaires data 11 / 20 Raw JSON file cd data raw cat billionaires_small.json #> [ #> { #> "wealth": { #> "worth in billions": [3.6], #> "how": { #> "category": ["Traded Sectors"], #> "from emerging": [true], #> "industry": ["Consumer"], #> "was political": [false], #> "inherited": [true], #> "was founder": [true] #> }, #> "type": ["founder non finance"] #> }, #> "company": { #> "sector": ["agricultural products"] #> "founded": [1929], #> "type": ["new"], #> "name": ["J.R. Simplot Company"], #> "relationship": ["founder"] #> }, 12 / 20 Raw JSON file When imported to R cd data raw cat billionaires_small.json #> [ #> { #> "wealth": { #> "worth in billions": [3.6], #> "how": { #> "category": ["Traded Sectors"], #> "from emerging": [true], #> "industry": ["Consumer"], #> "was political": [false], #> "inherited": [true], #> "was founder": [true] #> }, #> "type": ["founder non finance"] #> }, #> "company": { #> "sector": ["agricultural products"] #> "founded": [1929], #> "type": ["new"], #> "name": ["J.R. Simplot Company"], #> "relationship": ["founder"] #> }, str(billionaires_small, max.level = 3) #> List of 3 #> $ List of 7 #> $ wealth List of 3 #> $ worth in billions: num 3.6 #> $ how List of 6 #> $ type : chr "founder #> $ company List of 5 #> $ sector : chr "agricultural #> $ founded : int 1929 #> $ type : chr "new" #> $ name : chr "J.R. Simplot #> $ relationship: chr "founder" #> $ rank : int 115 #> $ location List of 4 #> $ gdp : num 1.06e+13 #> $ region : chr "North America #> $ citizenship : chr "United States #> $ country code: chr "USA" #> $ year : int 2001 #> $ demographics:List of 2 #> $ gender: chr "male" #> $ age : int 92 12 / 20 How to extract the element(s) of a list? 13 / 20 From a billionaire, extract info library(purrr) pluck(billionaires_small, 1) # you can also #> $wealth #> $wealth$`worth in billions` #> [1] 3.6 #> #> $wealth$how #> $wealth$how$category #> [1] "Traded Sectors" #> #> $wealth$how$`from emerging` #> [1] TRUE #> #> $wealth$how$industry #> [1] "Consumer" #> #> $wealth$how$`was political` #> [1] FALSE #> #> $wealth$how$inherited #> [1] TRUE 14 / 20 From a billionaire, extract info library(purrr) pluck(billionaires_small, 1) # you can also #> $wealth #> $wealth$`worth in billions` #> [1] 3.6 #> #> $wealth$how #> $wealth$how$category #> [1] "Traded Sectors" #> #> $wealth$how$`from emerging` #> [1] TRUE #> #> $wealth$how$industry #> [1] "Consumer" #> #> $wealth$how$`was political` #> [1] FALSE #> #> $wealth$how$inherited #> [1] TRUE pluck(billionaires_small, 1, "name") # you c #> [1] "John Simplot" pluck(billionaires_small, 1, "rank") #> [1] 115 pluck(billionaires_small, 1, "wealth", "wort #> [1] 3.6 14 / 20 From some billionaires, extract info map(billionaires_small, pluck, "name") #> [[1]] #> [1] "John Simplot" #> #> [[2]] #> [1] "Banyong Lamsam" #> #> [[3]] #> [1] "Richard Farmer" map(billionaires_small, pluck, "wealth", "wo #> [[1]] #> [1] 3.6 #> #> [[2]] #> [1] 2.5 #> #> [[3]] #> [1] 1.8 15 / 20 Awesome, map() provides a shortcut! Bye pluck()~ From some billionaires, extract info map(billionaires_small, pluck, "name") #> [[1]] #> [1] "John Simplot" #> #> [[2]] #> [1] "Banyong Lamsam" #> #> [[3]] #> [1] "Richard Farmer" map(billionaires_small, pluck, "wealth", "wo #> [[1]] #> [1] 3.6 #> #> [[2]] #> [1] 2.5 #> #> [[3]] #> [1] 1.8 (billionaire_names map_chr(billionaires_s #> [1] "John Simplot" "Banyong Lamsam" "Ri (billionaire_ranks map_int(billionaires_s #> [1] 115 143 272 (billionaire_worth map_dbl(billionaires_s #> [1] 3.6 2.5 1.8 15 / 20 Yeay, we can extract some infos 16 / 20 Yeay, we can extract some infos But, now they are scattered 16 / 20 Of course you can combine them later using data.frame() or tibble(), but... 17 / 20 data.frame( name = billionaire_names, rank = billionaire_ranks, worth_in_billions = billionaire_worth, stringsAsFactors = FALSE ) #> name rank worth_in_billions #> 1 John Simplot 115 3.6 #> 2 Banyong Lamsam 143 2.5 #> 3 Richard Farmer 272 1.8 Of course you can combine them later using data.frame() or tibble(), but... 17 / 20 data.frame( name = billionaire_names, rank = billionaire_ranks, worth_in_billions = billionaire_worth, stringsAsFactors = FALSE ) #> name rank worth_in_billions #> 1 John Simplot 115 3.6 #> 2 Banyong Lamsam 143 2.5 #> 3 Richard Farmer 272 1.8 library(tibble) tibble( name = billionaire_names, rank = billionaire_ranks, worth_in_billions = billionaire_worth ) #> # A tibble: 3 x 3 #> name rank worth_in_billions #> #> 1 John Simplot 115 3.6 #> 2 Banyong Lamsam 143 2.5 #> 3 Richard Farmer 272 1.8 Of course you can combine them later using data.frame() or tibble(), but... 17 / 20 Why don't we contain the list in dataframe/tibble in the first place? 18 / 20 Let's embrace list column library(tibble) billionaires_small_df billionaires_small %>% enframe() billionaires_small_df #> # A tibble: 3 x 2 #> name value #> #> 1 1 #> 2 2 #> 3 3 Why don't we contain the list in dataframe/tibble in the first place? 18 / 20 Let's embrace list column library(tibble) billionaires_small_df billionaires_small %>% enframe() billionaires_small_df #> # A tibble: 3 x 2 #> name value #> #> 1 1 #> 2 2 #> 3 3 Now we can make use of dplyr, ain't it cool? library(dplyr) billionaires_small_df %>% mutate( name = map_chr(value, "name"), rank = map_int(value, "rank"), worth_in_billions = map_dbl( value, list("wealth", "worth in billions")) ) %>% select(-value) #> # A tibble: 3 x 3 #> name rank worth_in_billions #> #> 1 John Simplot 115 3.6 #> 2 Banyong Lamsam 143 2.5 #> 3 Richard Farmer 272 1.8 Why don't we contain the list in dataframe/tibble in the first place? 18 / 20 Let's practice! Open your RStudio, then install usethis package Once succeed, run usethis use_course("aswansyahputra/kpdr_jogja") Follow the instruc ons and new RStudio session will be automa cally opened Please open hands on.Rmd and read the instruc ons thoroughly 19 / 20 R Indonesia w w w .r-in d o n e s ia .id Thank you!  t.me/GNURIndonesia  r-indonesia.id  [email protected] 20 / 20 R Indonesia w w w .r-in d o n e s ia .id Data rectangling: a journey from JSON to CSV Muhammad Aswan Syahputra 1 / 20

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text