Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Webscraping with rvest and purrr
Search
Max Humber
July 06, 2017
Programming
4
1.5k
Webscraping with rvest and purrr
useR!2017 / July 6, 2017 at 11:36am – 11:54am
Max Humber
July 06, 2017
Tweet
Share
More Decks by Max Humber
See All by Max Humber
Building Better Budgets
maxhumber
7
67
Accessible Algorithms
maxhumber
7
110
Amusing Algorithms
maxhumber
3
260
Data Creationism
maxhumber
4
640
Data Engineering for Data Scientists
maxhumber
6
1.2k
Personal Pynance
maxhumber
3
520
Visualizing Models
maxhumber
2
510
Patsy (PyData Berlin)
maxhumber
4
290
Data Driven Deviations
maxhumber
3
240
Other Decks in Programming
See All in Programming
エンジニアインターン「Treasure」とHonoの2年、そして未来へ / Our Journey with Hono Two Years at Treasure and Beyond
carta_engineering
0
360
あなたとKaigi on Rails / Kaigi on Rails + You
shimoju
0
170
CSC509 Lecture 04
javiergs
PRO
0
300
スキーマ駆動で、Zod OpenAPI Honoによる、API開発するために、Hono Takibiというライブラリを作っている
nakita628
0
230
登壇は dynamic! な営みである / speech is dynamic
da1chi
0
350
Cursorハンズオン実践!
eltociear
2
1.1k
TFLintカスタムプラグインで始める Terraformコード品質管理
bells17
2
330
CSC305 Lecture 04
javiergs
PRO
0
270
3年ぶりにコードを書いた元CTOが Claude Codeと30分でMVPを作った話
maikokojima
0
590
Swift Concurrency - 状態監視の罠
objectiveaudio
2
550
Range on Rails ―「多重範囲型」という新たな選択肢が、複雑ロジックを劇的にシンプルにしたワケ
rizap_tech
0
6.7k
bootcamp2025_バックエンド研修_WebAPIサーバ作成.pdf
geniee_inc
0
120
Featured
See All Featured
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
34
2.3k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
333
22k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
230
22k
Being A Developer After 40
akosma
91
590k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
23
1.5k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.5k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
10
880
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
35
3.2k
Side Projects
sachag
455
43k
Building a Scalable Design System with Sketch
lauravandoore
463
33k
Rebuilding a faster, lazier Slack
samanthasiow
84
9.2k
Faster Mobile Websites
deanohume
310
31k
Transcript
webscraping with rvest & purrr @maxhumber useR!2017 2017-07-06
how to make money* with rvest & purrr**
how to make money* with rvest & purrr**
* *probably not. and i’m not liable if you lose
money. disclaimer. disclaimer. disclaimer.
* *probably not. and i’m not liable if you lose
money. disclaimer. disclaimer. disclaimer.
whoami
None
None
None
None
None
https://github.com/hadley/rvest https://github.com/tidyverse/purrr
fake money
None
None
$143.09 $334.64
None
$143.09 $334.64 H$59.91 H$709.32
H$709.32
None
moviestock ticker starbond price (in $m) that market thinks movie
will make in first 4 weeks @ B/O
H$720.33 H$812.73
None
None
starbond ticker
starbond ticker
arbitrage
1 2 3 4 5
1 2 3 4 5 6.82 1.98 3.71 812.73 6.91
1 2 3 4 5 6.82 1.98 3.71 812.73 6.91
1 2 3 4 5 46.35 6.82 1.98 3.71 812.73
6.91
1 2 3 4 5 6.82 1.98 3.71 812.73 6.91
46.35
automate
chrome extension: http://selectorgadget.com/
library(tidyverse) library(rvest) library(stringr) url <- "https://www.hsx.com/security/view/ADRIV" movie <- read_html(url) %>%
html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] df <- tibble(date, movie, link)
library(tidyverse) library(rvest) library(stringr) url <- "https://www.hsx.com/security/view/ADRIV" movie <- read_html(url) %>%
html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] df <- tibble(date, movie, link) text value selector
library(tidyverse) library(rvest) library(stringr) url <- "https://www.hsx.com/security/view/ADRIV" movie <- read_html(url) %>%
html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] df <- tibble(date, movie, link) text value underlying link ticker garbage selector
library(tidyverse) library(rvest) library(stringr) url <- "https://www.hsx.com/security/view/ADRIV" movie <- read_html(url) %>%
html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] df <- tibble(date, movie, link) text value underlying link ticker garbage selector
get_movies <- function(ticker) { url <- str_c( "https://www.hsx.com/security/view/", toupper(ticker)) movie
<- read_html(url) %>% html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] return(df) } wrap in function
View(adam_driver) adam_driver <- get_movies('adriv')
None
None
None
get_price <- function(link) { if (substr(link, 0, 1) == '/')
{ url <- str_c('https://www.hsx.com', link) } else { url <- str_c('https://www.hsx.com/security/view/',toupper(link)) } df <- read_html(url) %>% html_nodes(".value") %>% html_text() %>% as_tibble() %>% mutate(value = parse_number( str_extract(., "(?<=\\$)[^\\s]+"))) %>% rename(price = value) %>% bind_cols(link = tibble(link)) return(df) } get_price('/security/view/STAR8') get_price('adriv')
get_price <- function(link) { if (substr(link, 0, 1) == '/')
{ url <- str_c('https://www.hsx.com', link) } else { url <- str_c('https://www.hsx.com/security/view/',toupper(link)) } df <- read_html(url) %>% html_nodes(".value") %>% html_text() %>% as_tibble() %>% mutate(value = parse_number( str_extract(., "(?<=\\$)[^\\s]+"))) %>% rename(price = value) %>% bind_cols(link = tibble(link)) return(df) } get_price('/security/view/STAR8') get_price('adriv')
get_price <- function(link) { if (substr(link, 0, 1) == '/')
{ url <- str_c('https://www.hsx.com', link) } else { url <- str_c('https://www.hsx.com/security/view/',toupper(link)) } df <- read_html(url) %>% html_nodes(".value") %>% html_text() %>% as_tibble() %>% mutate(value = parse_number( str_extract(., "(?<=\\$)[^\\s]+"))) %>% rename(price = value) %>% bind_cols(link = tibble(link)) return(df) } get_price('/security/view/STAR8') get_price('adriv')
get_price <- function(link) { if (substr(link, 0, 1) == '/')
{ url <- str_c('https://www.hsx.com', link) } else { url <- str_c('https://www.hsx.com/security/view/',toupper(link)) } df <- read_html(url) %>% html_nodes(".value") %>% html_text() %>% as_tibble() %>% mutate(value = parse_number( str_extract(., "(?<=\\$)[^\\s]+"))) %>% rename(price = value) %>% bind_cols(link = tibble(link)) return(df) } get_price('/security/view/STAR8') get_price('adriv')
> get_price(‘/security/view/STAR8') # A tibble: 1 x 2 price link
<dbl> <chr> 1 720.33 /security/view/STAR8 > get_price(‘adriv') # A tibble: 1 x 2 price link <dbl> <chr> 1 59.91 adriv
> get_price(‘/security/view/STAR8') # A tibble: 1 x 2 price link
<dbl> <chr> 1 720.33 /security/view/STAR8 > get_price(‘adriv') # A tibble: 1 x 2 price link <dbl> <chr> 1 59.91 adriv
None
get_prices <- function(df) { params <- df %>% select(link) prices
<- pmap(params, get_price) %>% bind_rows() df <- prices %>% left_join(df, by = "link") return(df) } adam_driver <- get_movies('adriv') %>% get_prices(.)
get_prices <- function(df) { params <- df %>% select(link) prices
<- pmap(params, get_price) %>% bind_rows() df <- prices %>% left_join(df, by = "link") return(df) } adam_driver <- get_movies('adriv') %>% get_prices(.)
get_prices <- function(df) { params <- df %>% select(link) prices
<- pmap(params, get_price) %>% bind_rows() df <- prices %>% left_join(df, by = "link") return(df) } adam_driver <- get_movies('adriv') %>% get_prices(.)
clean
forward_tag <- function(df) { tag <- df %>% mutate(date =
as.Date(date, "%b %d, %Y")) %>% drop_na() %>% mutate(days = date - Sys.Date()) %>% mutate(future = ifelse(days >= 0, 'Yes', 'No')) %>% mutate(days = abs(days)) %>% group_by(future) %>% arrange(desc(future), days) %>% mutate(idx = row_number()) %>% ungroup() %>% filter((idx == 1 & future == 'Yes') | (future == "No" & idx <= 5)) forward <- tag %>% filter(idx <= 4) %>% pull(price) %>% mean(.) %>% round(., 2) return(forward) }
adam_driver <- get_movies('adriv') %>% get_prices(.) get_price('adriv') %>% pull(price) [1] 59.91
forward_tag(adam_driver) [1] 173.89
adam_driver <- get_movies('adriv') %>% get_prices(.) get_price('adriv') %>% pull(price) [1] 59.91
forward_tag(adam_driver) [1] 173.89
real money**
None
None
run the ball throw the ball catch the ball hybrid
+kick the ball +entire defence catch the ball
data
None
None
fetch_espn <- function(position = 0, offset = 0) { url
<- str_c(sep = "", "http://games.espn.com/ffl/tools/projections?", "&slotCategoryId=", position, "&startIndex=", offset) page <- read_html(url) df <- page %>% html_node("#playertable_0") %>% html_table() return(df) } df <- fetch_espn(position = 16, offset = 0)
fetch_proj <- function(position = 0, offset = 0) { url
<- str_c(sep = "", "http://games.espn.com/ffl/tools/projections?", "&slotCategoryId=", position, "&startIndex=", offset) page <- read_html(url) df <- page %>% html_node("#playertable_0") %>% html_table() return(df) } df <- fetch_proj(position = 0, offset = 0)
df <- fetch_proj(position = 0, offset = 0) View(df)
params <- expand.grid( position = c(0, 2, 4, 6, 16,
17), offset = seq(0, 320, 40)) espn_raw <- pmap(params, fetch_espn) %>% bind_rows()
params <- expand.grid( position = c(0, 2, 4, 6, 16,
17), offset = seq(0, 320, 40)) espn_raw <- pmap(params, fetch_espn) %>% bind_rows()
params <- expand.grid( position = c(0, 2, 4, 6, 16,
17), offset = seq(0, 320, 40)) espn_raw <- pmap(params, fetch_espn) %>% bind_rows() fetch_espn(position = __, offset = __ )
None
params <- expand.grid( position = c(0, 2, 4, 6, 16,
17), offset = seq(0, 320, 40)) raw <- pmap(params, fetch_proj) %>% bind_rows()
View(raw)
clean
proj <- raw %>% mutate(PLAYERS = ifelse(is.na(PLAYERS), `DEFENSIVE PLAYERS`, PLAYERS))
%>% select(name = PLAYERS, points = TOTAL) %>% mutate(points = parse_number(points)) %>% drop_na() %>% separate(name, into = c("name", "metadata"), sep = ", ", fill = "right") %>% mutate(name = str_replace(name, "D\\/ST\\sD\\/ST|\\*$","")) %>% mutate(metadata = str_replace(metadata, "^[^\\s]*\\s","")) %>% mutate(metadata = str_replace(metadata, "IR|Q$|O|SSPD|D", "")) %>% mutate(metadata = str_trim(metadata)) %>% mutate(position = ifelse(is.na(metadata), "DEF", metadata)) %>% select(position, name, points)
View(proj)
bad <- proj %>% group_by(position) %>% mutate(avg = mean(points)) %>%
mutate(vorp = points - avg) %>% select(position, name, points, vorp) %>% arrange(desc(vorp))
x2 x1 x3 x1 K x1 DEF x1 10 people
params <- tribble( ~pos, ~slots, "QB", 1, "WR", 3, "RB",
2, "TE", 1, "K", 1, "DEF", 1 ) %>% mutate(slots = slots * 10) replacement_player <- function(pos, slots) { rp <- proj %>% filter(position == pos) %>% arrange(desc(points)) %>% filter(row_number() <= slots) %>% group_by(position) %>% summarise(rp = mean(points)) return(rp) } rp <- pmap(params, replacement_player) %>% bind_rows()
rp <- pmap(params, replacement_player) %>% bind_rows() proj_vorp <- proj %>%
left_join(rp, by = "position") %>% mutate(vorp = points - rp) %>% select(position, name, points, vorp) %>% arrange(desc(vorp))
None
* *probably not. and i’m not liable if you lose
money. disclaimer. disclaimer. disclaimer.
preferably with rvest & purrr
obey robots.txt
thanks!!
maxhumber