Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Webscraping with rvest and purrr
Search
Max Humber
July 06, 2017
Programming
4
1.4k
Webscraping with rvest and purrr
useR!2017 / July 6, 2017 at 11:36am – 11:54am
Max Humber
July 06, 2017
Tweet
Share
More Decks by Max Humber
See All by Max Humber
Building Better Budgets
maxhumber
7
62
Accessible Algorithms
maxhumber
7
95
Amusing Algorithms
maxhumber
3
240
Data Creationism
maxhumber
4
620
Data Engineering for Data Scientists
maxhumber
6
1.1k
Personal Pynance
maxhumber
3
490
Visualizing Models
maxhumber
2
490
Patsy (PyData Berlin)
maxhumber
4
280
Data Driven Deviations
maxhumber
3
230
Other Decks in Programming
See All in Programming
WordPress Playground for Developers
iambherulal
0
120
新卒から4年間、20年もののWebサービスと 向き合って学んだソフトウェア考古学
oguri
8
7.1k
本当だってば!俺もTRICK 2022に入賞してたんだってば!
jinroq
0
280
Java 24まとめ / Java 24 summary
kishida
3
220
ノーコードツールの裏側につきまとう「20分岐」との戦い
oguemon
0
170
List とは何か? / PHPerKaigi 2025
meihei3
0
590
gen_statem - OTP's Unsung Hero
whatyouhide
1
190
RubyKaigiで手に入れた HHKB Studioのための HIDRawドライバ
iberianpig
0
1.1k
Preact、HooksとSignalsの両立 / Preact: Harmonizing Hooks and Signals
ssssota
1
1.1k
Vibe Codingをせずに Clineを使っている
watany
13
4.7k
Agentic Applications with Symfony
el_stoffel
1
180
Unlock the Potential of Swift Code Generation
rockname
0
130
Featured
See All Featured
ReactJS: Keep Simple. Everything can be a component!
pedronauck
666
120k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
227
22k
Statistics for Hackers
jakevdp
798
220k
Code Review Best Practice
trishagee
67
18k
Optimizing for Happiness
mojombo
377
70k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
Building Applications with DynamoDB
mza
94
6.3k
RailsConf 2023
tenderlove
29
1k
Thoughts on Productivity
jonyablonski
69
4.6k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
177
52k
The Illustrated Children's Guide to Kubernetes
chrisshort
48
49k
Unsuck your backbone
ammeep
670
57k
Transcript
webscraping with rvest & purrr @maxhumber useR!2017 2017-07-06
how to make money* with rvest & purrr**
how to make money* with rvest & purrr**
* *probably not. and i’m not liable if you lose
money. disclaimer. disclaimer. disclaimer.
* *probably not. and i’m not liable if you lose
money. disclaimer. disclaimer. disclaimer.
whoami
None
None
None
None
None
https://github.com/hadley/rvest https://github.com/tidyverse/purrr
fake money
None
None
$143.09 $334.64
None
$143.09 $334.64 H$59.91 H$709.32
H$709.32
None
moviestock ticker starbond price (in $m) that market thinks movie
will make in first 4 weeks @ B/O
H$720.33 H$812.73
None
None
starbond ticker
starbond ticker
arbitrage
1 2 3 4 5
1 2 3 4 5 6.82 1.98 3.71 812.73 6.91
1 2 3 4 5 6.82 1.98 3.71 812.73 6.91
1 2 3 4 5 46.35 6.82 1.98 3.71 812.73
6.91
1 2 3 4 5 6.82 1.98 3.71 812.73 6.91
46.35
automate
chrome extension: http://selectorgadget.com/
library(tidyverse) library(rvest) library(stringr) url <- "https://www.hsx.com/security/view/ADRIV" movie <- read_html(url) %>%
html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] df <- tibble(date, movie, link)
library(tidyverse) library(rvest) library(stringr) url <- "https://www.hsx.com/security/view/ADRIV" movie <- read_html(url) %>%
html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] df <- tibble(date, movie, link) text value selector
library(tidyverse) library(rvest) library(stringr) url <- "https://www.hsx.com/security/view/ADRIV" movie <- read_html(url) %>%
html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] df <- tibble(date, movie, link) text value underlying link ticker garbage selector
library(tidyverse) library(rvest) library(stringr) url <- "https://www.hsx.com/security/view/ADRIV" movie <- read_html(url) %>%
html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] df <- tibble(date, movie, link) text value underlying link ticker garbage selector
get_movies <- function(ticker) { url <- str_c( "https://www.hsx.com/security/view/", toupper(ticker)) movie
<- read_html(url) %>% html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] return(df) } wrap in function
View(adam_driver) adam_driver <- get_movies('adriv')
None
None
None
get_price <- function(link) { if (substr(link, 0, 1) == '/')
{ url <- str_c('https://www.hsx.com', link) } else { url <- str_c('https://www.hsx.com/security/view/',toupper(link)) } df <- read_html(url) %>% html_nodes(".value") %>% html_text() %>% as_tibble() %>% mutate(value = parse_number( str_extract(., "(?<=\\$)[^\\s]+"))) %>% rename(price = value) %>% bind_cols(link = tibble(link)) return(df) } get_price('/security/view/STAR8') get_price('adriv')
get_price <- function(link) { if (substr(link, 0, 1) == '/')
{ url <- str_c('https://www.hsx.com', link) } else { url <- str_c('https://www.hsx.com/security/view/',toupper(link)) } df <- read_html(url) %>% html_nodes(".value") %>% html_text() %>% as_tibble() %>% mutate(value = parse_number( str_extract(., "(?<=\\$)[^\\s]+"))) %>% rename(price = value) %>% bind_cols(link = tibble(link)) return(df) } get_price('/security/view/STAR8') get_price('adriv')
get_price <- function(link) { if (substr(link, 0, 1) == '/')
{ url <- str_c('https://www.hsx.com', link) } else { url <- str_c('https://www.hsx.com/security/view/',toupper(link)) } df <- read_html(url) %>% html_nodes(".value") %>% html_text() %>% as_tibble() %>% mutate(value = parse_number( str_extract(., "(?<=\\$)[^\\s]+"))) %>% rename(price = value) %>% bind_cols(link = tibble(link)) return(df) } get_price('/security/view/STAR8') get_price('adriv')
get_price <- function(link) { if (substr(link, 0, 1) == '/')
{ url <- str_c('https://www.hsx.com', link) } else { url <- str_c('https://www.hsx.com/security/view/',toupper(link)) } df <- read_html(url) %>% html_nodes(".value") %>% html_text() %>% as_tibble() %>% mutate(value = parse_number( str_extract(., "(?<=\\$)[^\\s]+"))) %>% rename(price = value) %>% bind_cols(link = tibble(link)) return(df) } get_price('/security/view/STAR8') get_price('adriv')
> get_price(‘/security/view/STAR8') # A tibble: 1 x 2 price link
<dbl> <chr> 1 720.33 /security/view/STAR8 > get_price(‘adriv') # A tibble: 1 x 2 price link <dbl> <chr> 1 59.91 adriv
> get_price(‘/security/view/STAR8') # A tibble: 1 x 2 price link
<dbl> <chr> 1 720.33 /security/view/STAR8 > get_price(‘adriv') # A tibble: 1 x 2 price link <dbl> <chr> 1 59.91 adriv
None
get_prices <- function(df) { params <- df %>% select(link) prices
<- pmap(params, get_price) %>% bind_rows() df <- prices %>% left_join(df, by = "link") return(df) } adam_driver <- get_movies('adriv') %>% get_prices(.)
get_prices <- function(df) { params <- df %>% select(link) prices
<- pmap(params, get_price) %>% bind_rows() df <- prices %>% left_join(df, by = "link") return(df) } adam_driver <- get_movies('adriv') %>% get_prices(.)
get_prices <- function(df) { params <- df %>% select(link) prices
<- pmap(params, get_price) %>% bind_rows() df <- prices %>% left_join(df, by = "link") return(df) } adam_driver <- get_movies('adriv') %>% get_prices(.)
clean
forward_tag <- function(df) { tag <- df %>% mutate(date =
as.Date(date, "%b %d, %Y")) %>% drop_na() %>% mutate(days = date - Sys.Date()) %>% mutate(future = ifelse(days >= 0, 'Yes', 'No')) %>% mutate(days = abs(days)) %>% group_by(future) %>% arrange(desc(future), days) %>% mutate(idx = row_number()) %>% ungroup() %>% filter((idx == 1 & future == 'Yes') | (future == "No" & idx <= 5)) forward <- tag %>% filter(idx <= 4) %>% pull(price) %>% mean(.) %>% round(., 2) return(forward) }
adam_driver <- get_movies('adriv') %>% get_prices(.) get_price('adriv') %>% pull(price) [1] 59.91
forward_tag(adam_driver) [1] 173.89
adam_driver <- get_movies('adriv') %>% get_prices(.) get_price('adriv') %>% pull(price) [1] 59.91
forward_tag(adam_driver) [1] 173.89
real money**
None
None
run the ball throw the ball catch the ball hybrid
+kick the ball +entire defence catch the ball
data
None
None
fetch_espn <- function(position = 0, offset = 0) { url
<- str_c(sep = "", "http://games.espn.com/ffl/tools/projections?", "&slotCategoryId=", position, "&startIndex=", offset) page <- read_html(url) df <- page %>% html_node("#playertable_0") %>% html_table() return(df) } df <- fetch_espn(position = 16, offset = 0)
fetch_proj <- function(position = 0, offset = 0) { url
<- str_c(sep = "", "http://games.espn.com/ffl/tools/projections?", "&slotCategoryId=", position, "&startIndex=", offset) page <- read_html(url) df <- page %>% html_node("#playertable_0") %>% html_table() return(df) } df <- fetch_proj(position = 0, offset = 0)
df <- fetch_proj(position = 0, offset = 0) View(df)
params <- expand.grid( position = c(0, 2, 4, 6, 16,
17), offset = seq(0, 320, 40)) espn_raw <- pmap(params, fetch_espn) %>% bind_rows()
params <- expand.grid( position = c(0, 2, 4, 6, 16,
17), offset = seq(0, 320, 40)) espn_raw <- pmap(params, fetch_espn) %>% bind_rows()
params <- expand.grid( position = c(0, 2, 4, 6, 16,
17), offset = seq(0, 320, 40)) espn_raw <- pmap(params, fetch_espn) %>% bind_rows() fetch_espn(position = __, offset = __ )
None
params <- expand.grid( position = c(0, 2, 4, 6, 16,
17), offset = seq(0, 320, 40)) raw <- pmap(params, fetch_proj) %>% bind_rows()
View(raw)
clean
proj <- raw %>% mutate(PLAYERS = ifelse(is.na(PLAYERS), `DEFENSIVE PLAYERS`, PLAYERS))
%>% select(name = PLAYERS, points = TOTAL) %>% mutate(points = parse_number(points)) %>% drop_na() %>% separate(name, into = c("name", "metadata"), sep = ", ", fill = "right") %>% mutate(name = str_replace(name, "D\\/ST\\sD\\/ST|\\*$","")) %>% mutate(metadata = str_replace(metadata, "^[^\\s]*\\s","")) %>% mutate(metadata = str_replace(metadata, "IR|Q$|O|SSPD|D", "")) %>% mutate(metadata = str_trim(metadata)) %>% mutate(position = ifelse(is.na(metadata), "DEF", metadata)) %>% select(position, name, points)
View(proj)
bad <- proj %>% group_by(position) %>% mutate(avg = mean(points)) %>%
mutate(vorp = points - avg) %>% select(position, name, points, vorp) %>% arrange(desc(vorp))
x2 x1 x3 x1 K x1 DEF x1 10 people
params <- tribble( ~pos, ~slots, "QB", 1, "WR", 3, "RB",
2, "TE", 1, "K", 1, "DEF", 1 ) %>% mutate(slots = slots * 10) replacement_player <- function(pos, slots) { rp <- proj %>% filter(position == pos) %>% arrange(desc(points)) %>% filter(row_number() <= slots) %>% group_by(position) %>% summarise(rp = mean(points)) return(rp) } rp <- pmap(params, replacement_player) %>% bind_rows()
rp <- pmap(params, replacement_player) %>% bind_rows() proj_vorp <- proj %>%
left_join(rp, by = "position") %>% mutate(vorp = points - rp) %>% select(position, name, points, vorp) %>% arrange(desc(vorp))
None
* *probably not. and i’m not liable if you lose
money. disclaimer. disclaimer. disclaimer.
preferably with rvest & purrr
obey robots.txt
thanks!!
maxhumber