$30 off During Our Annual Pro Sale. View Details »
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Webscraping with rvest and purrr
Search
Max Humber
July 06, 2017
Programming
4
1.5k
Webscraping with rvest and purrr
useR!2017 / July 6, 2017 at 11:36am – 11:54am
Max Humber
July 06, 2017
Tweet
Share
More Decks by Max Humber
See All by Max Humber
Building Better Budgets
maxhumber
7
69
Accessible Algorithms
maxhumber
7
110
Amusing Algorithms
maxhumber
3
260
Data Creationism
maxhumber
4
640
Data Engineering for Data Scientists
maxhumber
6
1.2k
Personal Pynance
maxhumber
3
520
Visualizing Models
maxhumber
2
510
Patsy (PyData Berlin)
maxhumber
4
290
Data Driven Deviations
maxhumber
3
240
Other Decks in Programming
See All in Programming
ハイパーメディア駆動アプリケーションとIslandアーキテクチャ: htmxによるWebアプリケーション開発と動的UIの局所的適用
nowaki28
0
390
手が足りない!兼業データエンジニアに必要だったアーキテクチャと立ち回り
zinkosuke
0
600
社内オペレーション改善のためのTypeScript / TSKaigi Hokuriku 2025
dachi023
1
570
Developing static sites with Ruby
okuramasafumi
0
260
AIコーディングエージェント(skywork)
kondai24
0
150
チームをチームにするEM
hitode909
0
300
【CA.ai #3】ワークフローから見直すAIエージェント — 必要な場面と“選ばない”判断
satoaoaka
0
230
251126 TestState APIってなんだっけ?Step Functionsテストどう変わる?
east_takumi
0
310
Why Kotlin? 電子カルテを Kotlin で開発する理由 / Why Kotlin? at Henry
agatan
2
6.9k
Level up your Gemini CLI - D&D Style!
palladius
1
180
TypeScript 5.9 で使えるようになった import defer でパフォーマンス最適化を実現する
bicstone
1
1.3k
Navigation 3: 적응형 UI를 위한 앱 탐색
fornewid
1
220
Featured
See All Featured
Site-Speed That Sticks
csswizardry
13
990
Java REST API Framework Comparison - PWX 2021
mraible
34
9k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
32
2.7k
Thoughts on Productivity
jonyablonski
73
5k
StorybookのUI Testing Handbookを読んだ
zakiyama
31
6.4k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
35
3.3k
Measuring & Analyzing Core Web Vitals
bluesmoon
9
700
The Art of Programming - Codeland 2020
erikaheidi
56
14k
Intergalactic Javascript Robots from Outer Space
tanoku
273
27k
Large-scale JavaScript Application Architecture
addyosmani
515
110k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
Writing Fast Ruby
sferik
630
62k
Transcript
webscraping with rvest & purrr @maxhumber useR!2017 2017-07-06
how to make money* with rvest & purrr**
how to make money* with rvest & purrr**
* *probably not. and i’m not liable if you lose
money. disclaimer. disclaimer. disclaimer.
* *probably not. and i’m not liable if you lose
money. disclaimer. disclaimer. disclaimer.
whoami
None
None
None
None
None
https://github.com/hadley/rvest https://github.com/tidyverse/purrr
fake money
None
None
$143.09 $334.64
None
$143.09 $334.64 H$59.91 H$709.32
H$709.32
None
moviestock ticker starbond price (in $m) that market thinks movie
will make in first 4 weeks @ B/O
H$720.33 H$812.73
None
None
starbond ticker
starbond ticker
arbitrage
1 2 3 4 5
1 2 3 4 5 6.82 1.98 3.71 812.73 6.91
1 2 3 4 5 6.82 1.98 3.71 812.73 6.91
1 2 3 4 5 46.35 6.82 1.98 3.71 812.73
6.91
1 2 3 4 5 6.82 1.98 3.71 812.73 6.91
46.35
automate
chrome extension: http://selectorgadget.com/
library(tidyverse) library(rvest) library(stringr) url <- "https://www.hsx.com/security/view/ADRIV" movie <- read_html(url) %>%
html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] df <- tibble(date, movie, link)
library(tidyverse) library(rvest) library(stringr) url <- "https://www.hsx.com/security/view/ADRIV" movie <- read_html(url) %>%
html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] df <- tibble(date, movie, link) text value selector
library(tidyverse) library(rvest) library(stringr) url <- "https://www.hsx.com/security/view/ADRIV" movie <- read_html(url) %>%
html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] df <- tibble(date, movie, link) text value underlying link ticker garbage selector
library(tidyverse) library(rvest) library(stringr) url <- "https://www.hsx.com/security/view/ADRIV" movie <- read_html(url) %>%
html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] df <- tibble(date, movie, link) text value underlying link ticker garbage selector
get_movies <- function(ticker) { url <- str_c( "https://www.hsx.com/security/view/", toupper(ticker)) movie
<- read_html(url) %>% html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] return(df) } wrap in function
View(adam_driver) adam_driver <- get_movies('adriv')
None
None
None
get_price <- function(link) { if (substr(link, 0, 1) == '/')
{ url <- str_c('https://www.hsx.com', link) } else { url <- str_c('https://www.hsx.com/security/view/',toupper(link)) } df <- read_html(url) %>% html_nodes(".value") %>% html_text() %>% as_tibble() %>% mutate(value = parse_number( str_extract(., "(?<=\\$)[^\\s]+"))) %>% rename(price = value) %>% bind_cols(link = tibble(link)) return(df) } get_price('/security/view/STAR8') get_price('adriv')
get_price <- function(link) { if (substr(link, 0, 1) == '/')
{ url <- str_c('https://www.hsx.com', link) } else { url <- str_c('https://www.hsx.com/security/view/',toupper(link)) } df <- read_html(url) %>% html_nodes(".value") %>% html_text() %>% as_tibble() %>% mutate(value = parse_number( str_extract(., "(?<=\\$)[^\\s]+"))) %>% rename(price = value) %>% bind_cols(link = tibble(link)) return(df) } get_price('/security/view/STAR8') get_price('adriv')
get_price <- function(link) { if (substr(link, 0, 1) == '/')
{ url <- str_c('https://www.hsx.com', link) } else { url <- str_c('https://www.hsx.com/security/view/',toupper(link)) } df <- read_html(url) %>% html_nodes(".value") %>% html_text() %>% as_tibble() %>% mutate(value = parse_number( str_extract(., "(?<=\\$)[^\\s]+"))) %>% rename(price = value) %>% bind_cols(link = tibble(link)) return(df) } get_price('/security/view/STAR8') get_price('adriv')
get_price <- function(link) { if (substr(link, 0, 1) == '/')
{ url <- str_c('https://www.hsx.com', link) } else { url <- str_c('https://www.hsx.com/security/view/',toupper(link)) } df <- read_html(url) %>% html_nodes(".value") %>% html_text() %>% as_tibble() %>% mutate(value = parse_number( str_extract(., "(?<=\\$)[^\\s]+"))) %>% rename(price = value) %>% bind_cols(link = tibble(link)) return(df) } get_price('/security/view/STAR8') get_price('adriv')
> get_price(‘/security/view/STAR8') # A tibble: 1 x 2 price link
<dbl> <chr> 1 720.33 /security/view/STAR8 > get_price(‘adriv') # A tibble: 1 x 2 price link <dbl> <chr> 1 59.91 adriv
> get_price(‘/security/view/STAR8') # A tibble: 1 x 2 price link
<dbl> <chr> 1 720.33 /security/view/STAR8 > get_price(‘adriv') # A tibble: 1 x 2 price link <dbl> <chr> 1 59.91 adriv
None
get_prices <- function(df) { params <- df %>% select(link) prices
<- pmap(params, get_price) %>% bind_rows() df <- prices %>% left_join(df, by = "link") return(df) } adam_driver <- get_movies('adriv') %>% get_prices(.)
get_prices <- function(df) { params <- df %>% select(link) prices
<- pmap(params, get_price) %>% bind_rows() df <- prices %>% left_join(df, by = "link") return(df) } adam_driver <- get_movies('adriv') %>% get_prices(.)
get_prices <- function(df) { params <- df %>% select(link) prices
<- pmap(params, get_price) %>% bind_rows() df <- prices %>% left_join(df, by = "link") return(df) } adam_driver <- get_movies('adriv') %>% get_prices(.)
clean
forward_tag <- function(df) { tag <- df %>% mutate(date =
as.Date(date, "%b %d, %Y")) %>% drop_na() %>% mutate(days = date - Sys.Date()) %>% mutate(future = ifelse(days >= 0, 'Yes', 'No')) %>% mutate(days = abs(days)) %>% group_by(future) %>% arrange(desc(future), days) %>% mutate(idx = row_number()) %>% ungroup() %>% filter((idx == 1 & future == 'Yes') | (future == "No" & idx <= 5)) forward <- tag %>% filter(idx <= 4) %>% pull(price) %>% mean(.) %>% round(., 2) return(forward) }
adam_driver <- get_movies('adriv') %>% get_prices(.) get_price('adriv') %>% pull(price) [1] 59.91
forward_tag(adam_driver) [1] 173.89
adam_driver <- get_movies('adriv') %>% get_prices(.) get_price('adriv') %>% pull(price) [1] 59.91
forward_tag(adam_driver) [1] 173.89
real money**
None
None
run the ball throw the ball catch the ball hybrid
+kick the ball +entire defence catch the ball
data
None
None
fetch_espn <- function(position = 0, offset = 0) { url
<- str_c(sep = "", "http://games.espn.com/ffl/tools/projections?", "&slotCategoryId=", position, "&startIndex=", offset) page <- read_html(url) df <- page %>% html_node("#playertable_0") %>% html_table() return(df) } df <- fetch_espn(position = 16, offset = 0)
fetch_proj <- function(position = 0, offset = 0) { url
<- str_c(sep = "", "http://games.espn.com/ffl/tools/projections?", "&slotCategoryId=", position, "&startIndex=", offset) page <- read_html(url) df <- page %>% html_node("#playertable_0") %>% html_table() return(df) } df <- fetch_proj(position = 0, offset = 0)
df <- fetch_proj(position = 0, offset = 0) View(df)
params <- expand.grid( position = c(0, 2, 4, 6, 16,
17), offset = seq(0, 320, 40)) espn_raw <- pmap(params, fetch_espn) %>% bind_rows()
params <- expand.grid( position = c(0, 2, 4, 6, 16,
17), offset = seq(0, 320, 40)) espn_raw <- pmap(params, fetch_espn) %>% bind_rows()
params <- expand.grid( position = c(0, 2, 4, 6, 16,
17), offset = seq(0, 320, 40)) espn_raw <- pmap(params, fetch_espn) %>% bind_rows() fetch_espn(position = __, offset = __ )
None
params <- expand.grid( position = c(0, 2, 4, 6, 16,
17), offset = seq(0, 320, 40)) raw <- pmap(params, fetch_proj) %>% bind_rows()
View(raw)
clean
proj <- raw %>% mutate(PLAYERS = ifelse(is.na(PLAYERS), `DEFENSIVE PLAYERS`, PLAYERS))
%>% select(name = PLAYERS, points = TOTAL) %>% mutate(points = parse_number(points)) %>% drop_na() %>% separate(name, into = c("name", "metadata"), sep = ", ", fill = "right") %>% mutate(name = str_replace(name, "D\\/ST\\sD\\/ST|\\*$","")) %>% mutate(metadata = str_replace(metadata, "^[^\\s]*\\s","")) %>% mutate(metadata = str_replace(metadata, "IR|Q$|O|SSPD|D", "")) %>% mutate(metadata = str_trim(metadata)) %>% mutate(position = ifelse(is.na(metadata), "DEF", metadata)) %>% select(position, name, points)
View(proj)
bad <- proj %>% group_by(position) %>% mutate(avg = mean(points)) %>%
mutate(vorp = points - avg) %>% select(position, name, points, vorp) %>% arrange(desc(vorp))
x2 x1 x3 x1 K x1 DEF x1 10 people
params <- tribble( ~pos, ~slots, "QB", 1, "WR", 3, "RB",
2, "TE", 1, "K", 1, "DEF", 1 ) %>% mutate(slots = slots * 10) replacement_player <- function(pos, slots) { rp <- proj %>% filter(position == pos) %>% arrange(desc(points)) %>% filter(row_number() <= slots) %>% group_by(position) %>% summarise(rp = mean(points)) return(rp) } rp <- pmap(params, replacement_player) %>% bind_rows()
rp <- pmap(params, replacement_player) %>% bind_rows() proj_vorp <- proj %>%
left_join(rp, by = "position") %>% mutate(vorp = points - rp) %>% select(position, name, points, vorp) %>% arrange(desc(vorp))
None
* *probably not. and i’m not liable if you lose
money. disclaimer. disclaimer. disclaimer.
preferably with rvest & purrr
obey robots.txt
thanks!!
maxhumber