Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Webscraping with rvest and purrr
Search
Max Humber
July 06, 2017
Programming
4
1.3k
Webscraping with rvest and purrr
useR!2017 / July 6, 2017 at 11:36am – 11:54am
Max Humber
July 06, 2017
Tweet
Share
More Decks by Max Humber
See All by Max Humber
Building Better Budgets
maxhumber
7
61
Accessible Algorithms
maxhumber
7
91
Amusing Algorithms
maxhumber
3
240
Data Creationism
maxhumber
4
610
Data Engineering for Data Scientists
maxhumber
6
1.1k
Personal Pynance
maxhumber
3
460
Visualizing Models
maxhumber
2
490
Patsy (PyData Berlin)
maxhumber
4
280
Data Driven Deviations
maxhumber
3
230
Other Decks in Programming
See All in Programming
subpath importsで始めるモック生活
10tera
0
300
Jakarta EE meets AI
ivargrimstad
0
120
【Kaigi on Rails 2024】YOUTRUST スポンサーLT
krpk1900
1
330
どうして僕の作ったクラスが手続き型と言われなきゃいけないんですか
akikogoto
1
120
TypeScriptでライブラリとの依存を限定的にする方法
tutinoko
2
660
Creating a Free Video Ad Network on the Edge
mizoguchicoji
0
110
Jakarta EE meets AI
ivargrimstad
0
580
Jakarta EE meets AI
ivargrimstad
0
510
GitHub Actionsのキャッシュと手を挙げることの大切さとそれに必要なこと
satoshi256kbyte
5
430
聞き手から登壇者へ: RubyKaigi2024 LTでの初挑戦が 教えてくれた、可能性の星
mikik0
1
130
3 Effective Rules for Using Signals in Angular
manfredsteyer
PRO
0
110
詳細解説! ArrayListの仕組みと実装
yujisoftware
0
580
Featured
See All Featured
Facilitating Awesome Meetings
lara
50
6.1k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
131
33k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
29
2.3k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
159
15k
Six Lessons from altMBA
skipperchong
27
3.5k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
8
740
Music & Morning Musume
bryan
46
6.2k
Agile that works and the tools we love
rasmusluckow
327
21k
The Invisible Side of Design
smashingmag
298
50k
Statistics for Hackers
jakevdp
796
220k
GraphQLとの向き合い方2022年版
quramy
43
13k
Code Reviewing Like a Champion
maltzj
520
39k
Transcript
webscraping with rvest & purrr @maxhumber useR!2017 2017-07-06
how to make money* with rvest & purrr**
how to make money* with rvest & purrr**
* *probably not. and i’m not liable if you lose
money. disclaimer. disclaimer. disclaimer.
* *probably not. and i’m not liable if you lose
money. disclaimer. disclaimer. disclaimer.
whoami
None
None
None
None
None
https://github.com/hadley/rvest https://github.com/tidyverse/purrr
fake money
None
None
$143.09 $334.64
None
$143.09 $334.64 H$59.91 H$709.32
H$709.32
None
moviestock ticker starbond price (in $m) that market thinks movie
will make in first 4 weeks @ B/O
H$720.33 H$812.73
None
None
starbond ticker
starbond ticker
arbitrage
1 2 3 4 5
1 2 3 4 5 6.82 1.98 3.71 812.73 6.91
1 2 3 4 5 6.82 1.98 3.71 812.73 6.91
1 2 3 4 5 46.35 6.82 1.98 3.71 812.73
6.91
1 2 3 4 5 6.82 1.98 3.71 812.73 6.91
46.35
automate
chrome extension: http://selectorgadget.com/
library(tidyverse) library(rvest) library(stringr) url <- "https://www.hsx.com/security/view/ADRIV" movie <- read_html(url) %>%
html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] df <- tibble(date, movie, link)
library(tidyverse) library(rvest) library(stringr) url <- "https://www.hsx.com/security/view/ADRIV" movie <- read_html(url) %>%
html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] df <- tibble(date, movie, link) text value selector
library(tidyverse) library(rvest) library(stringr) url <- "https://www.hsx.com/security/view/ADRIV" movie <- read_html(url) %>%
html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] df <- tibble(date, movie, link) text value underlying link ticker garbage selector
library(tidyverse) library(rvest) library(stringr) url <- "https://www.hsx.com/security/view/ADRIV" movie <- read_html(url) %>%
html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] df <- tibble(date, movie, link) text value underlying link ticker garbage selector
get_movies <- function(ticker) { url <- str_c( "https://www.hsx.com/security/view/", toupper(ticker)) movie
<- read_html(url) %>% html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] return(df) } wrap in function
View(adam_driver) adam_driver <- get_movies('adriv')
None
None
None
get_price <- function(link) { if (substr(link, 0, 1) == '/')
{ url <- str_c('https://www.hsx.com', link) } else { url <- str_c('https://www.hsx.com/security/view/',toupper(link)) } df <- read_html(url) %>% html_nodes(".value") %>% html_text() %>% as_tibble() %>% mutate(value = parse_number( str_extract(., "(?<=\\$)[^\\s]+"))) %>% rename(price = value) %>% bind_cols(link = tibble(link)) return(df) } get_price('/security/view/STAR8') get_price('adriv')
get_price <- function(link) { if (substr(link, 0, 1) == '/')
{ url <- str_c('https://www.hsx.com', link) } else { url <- str_c('https://www.hsx.com/security/view/',toupper(link)) } df <- read_html(url) %>% html_nodes(".value") %>% html_text() %>% as_tibble() %>% mutate(value = parse_number( str_extract(., "(?<=\\$)[^\\s]+"))) %>% rename(price = value) %>% bind_cols(link = tibble(link)) return(df) } get_price('/security/view/STAR8') get_price('adriv')
get_price <- function(link) { if (substr(link, 0, 1) == '/')
{ url <- str_c('https://www.hsx.com', link) } else { url <- str_c('https://www.hsx.com/security/view/',toupper(link)) } df <- read_html(url) %>% html_nodes(".value") %>% html_text() %>% as_tibble() %>% mutate(value = parse_number( str_extract(., "(?<=\\$)[^\\s]+"))) %>% rename(price = value) %>% bind_cols(link = tibble(link)) return(df) } get_price('/security/view/STAR8') get_price('adriv')
get_price <- function(link) { if (substr(link, 0, 1) == '/')
{ url <- str_c('https://www.hsx.com', link) } else { url <- str_c('https://www.hsx.com/security/view/',toupper(link)) } df <- read_html(url) %>% html_nodes(".value") %>% html_text() %>% as_tibble() %>% mutate(value = parse_number( str_extract(., "(?<=\\$)[^\\s]+"))) %>% rename(price = value) %>% bind_cols(link = tibble(link)) return(df) } get_price('/security/view/STAR8') get_price('adriv')
> get_price(‘/security/view/STAR8') # A tibble: 1 x 2 price link
<dbl> <chr> 1 720.33 /security/view/STAR8 > get_price(‘adriv') # A tibble: 1 x 2 price link <dbl> <chr> 1 59.91 adriv
> get_price(‘/security/view/STAR8') # A tibble: 1 x 2 price link
<dbl> <chr> 1 720.33 /security/view/STAR8 > get_price(‘adriv') # A tibble: 1 x 2 price link <dbl> <chr> 1 59.91 adriv
None
get_prices <- function(df) { params <- df %>% select(link) prices
<- pmap(params, get_price) %>% bind_rows() df <- prices %>% left_join(df, by = "link") return(df) } adam_driver <- get_movies('adriv') %>% get_prices(.)
get_prices <- function(df) { params <- df %>% select(link) prices
<- pmap(params, get_price) %>% bind_rows() df <- prices %>% left_join(df, by = "link") return(df) } adam_driver <- get_movies('adriv') %>% get_prices(.)
get_prices <- function(df) { params <- df %>% select(link) prices
<- pmap(params, get_price) %>% bind_rows() df <- prices %>% left_join(df, by = "link") return(df) } adam_driver <- get_movies('adriv') %>% get_prices(.)
clean
forward_tag <- function(df) { tag <- df %>% mutate(date =
as.Date(date, "%b %d, %Y")) %>% drop_na() %>% mutate(days = date - Sys.Date()) %>% mutate(future = ifelse(days >= 0, 'Yes', 'No')) %>% mutate(days = abs(days)) %>% group_by(future) %>% arrange(desc(future), days) %>% mutate(idx = row_number()) %>% ungroup() %>% filter((idx == 1 & future == 'Yes') | (future == "No" & idx <= 5)) forward <- tag %>% filter(idx <= 4) %>% pull(price) %>% mean(.) %>% round(., 2) return(forward) }
adam_driver <- get_movies('adriv') %>% get_prices(.) get_price('adriv') %>% pull(price) [1] 59.91
forward_tag(adam_driver) [1] 173.89
adam_driver <- get_movies('adriv') %>% get_prices(.) get_price('adriv') %>% pull(price) [1] 59.91
forward_tag(adam_driver) [1] 173.89
real money**
None
None
run the ball throw the ball catch the ball hybrid
+kick the ball +entire defence catch the ball
data
None
None
fetch_espn <- function(position = 0, offset = 0) { url
<- str_c(sep = "", "http://games.espn.com/ffl/tools/projections?", "&slotCategoryId=", position, "&startIndex=", offset) page <- read_html(url) df <- page %>% html_node("#playertable_0") %>% html_table() return(df) } df <- fetch_espn(position = 16, offset = 0)
fetch_proj <- function(position = 0, offset = 0) { url
<- str_c(sep = "", "http://games.espn.com/ffl/tools/projections?", "&slotCategoryId=", position, "&startIndex=", offset) page <- read_html(url) df <- page %>% html_node("#playertable_0") %>% html_table() return(df) } df <- fetch_proj(position = 0, offset = 0)
df <- fetch_proj(position = 0, offset = 0) View(df)
params <- expand.grid( position = c(0, 2, 4, 6, 16,
17), offset = seq(0, 320, 40)) espn_raw <- pmap(params, fetch_espn) %>% bind_rows()
params <- expand.grid( position = c(0, 2, 4, 6, 16,
17), offset = seq(0, 320, 40)) espn_raw <- pmap(params, fetch_espn) %>% bind_rows()
params <- expand.grid( position = c(0, 2, 4, 6, 16,
17), offset = seq(0, 320, 40)) espn_raw <- pmap(params, fetch_espn) %>% bind_rows() fetch_espn(position = __, offset = __ )
None
params <- expand.grid( position = c(0, 2, 4, 6, 16,
17), offset = seq(0, 320, 40)) raw <- pmap(params, fetch_proj) %>% bind_rows()
View(raw)
clean
proj <- raw %>% mutate(PLAYERS = ifelse(is.na(PLAYERS), `DEFENSIVE PLAYERS`, PLAYERS))
%>% select(name = PLAYERS, points = TOTAL) %>% mutate(points = parse_number(points)) %>% drop_na() %>% separate(name, into = c("name", "metadata"), sep = ", ", fill = "right") %>% mutate(name = str_replace(name, "D\\/ST\\sD\\/ST|\\*$","")) %>% mutate(metadata = str_replace(metadata, "^[^\\s]*\\s","")) %>% mutate(metadata = str_replace(metadata, "IR|Q$|O|SSPD|D", "")) %>% mutate(metadata = str_trim(metadata)) %>% mutate(position = ifelse(is.na(metadata), "DEF", metadata)) %>% select(position, name, points)
View(proj)
bad <- proj %>% group_by(position) %>% mutate(avg = mean(points)) %>%
mutate(vorp = points - avg) %>% select(position, name, points, vorp) %>% arrange(desc(vorp))
x2 x1 x3 x1 K x1 DEF x1 10 people
params <- tribble( ~pos, ~slots, "QB", 1, "WR", 3, "RB",
2, "TE", 1, "K", 1, "DEF", 1 ) %>% mutate(slots = slots * 10) replacement_player <- function(pos, slots) { rp <- proj %>% filter(position == pos) %>% arrange(desc(points)) %>% filter(row_number() <= slots) %>% group_by(position) %>% summarise(rp = mean(points)) return(rp) } rp <- pmap(params, replacement_player) %>% bind_rows()
rp <- pmap(params, replacement_player) %>% bind_rows() proj_vorp <- proj %>%
left_join(rp, by = "position") %>% mutate(vorp = points - rp) %>% select(position, name, points, vorp) %>% arrange(desc(vorp))
None
* *probably not. and i’m not liable if you lose
money. disclaimer. disclaimer. disclaimer.
preferably with rvest & purrr
obey robots.txt
thanks!!
maxhumber