Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Webscraping with rvest and purrr
Search
Max Humber
July 06, 2017
Programming
4
1.4k
Webscraping with rvest and purrr
useR!2017 / July 6, 2017 at 11:36am – 11:54am
Max Humber
July 06, 2017
Tweet
Share
More Decks by Max Humber
See All by Max Humber
Building Better Budgets
maxhumber
7
65
Accessible Algorithms
maxhumber
7
100
Amusing Algorithms
maxhumber
3
250
Data Creationism
maxhumber
4
630
Data Engineering for Data Scientists
maxhumber
6
1.2k
Personal Pynance
maxhumber
3
520
Visualizing Models
maxhumber
2
500
Patsy (PyData Berlin)
maxhumber
4
280
Data Driven Deviations
maxhumber
3
240
Other Decks in Programming
See All in Programming
코딩 에이전트 체크리스트: Claude Code ver.
nacyot
0
870
Goで作る、開発・CI環境
sin392
0
260
MCPを使ってイベントソーシングのAIコーディングを効率化する / Streamlining Event Sourcing AI Coding with MCP
tomohisa
0
160
PicoRuby on Rails
makicamel
2
140
Flutterで備える!Accessibility Nutrition Labels完全ガイド
yuukiw00w
0
170
20250704_教育事業におけるアジャイルなデータ基盤構築
hanon52_
5
910
生成AI時代のコンポーネントライブラリの作り方
touyou
1
260
はじめてのWeb API体験 ー 飲食店検索アプリを作ろうー
akinko_0915
0
130
Agentic Coding: The Future of Software Development with Agents
mitsuhiko
0
120
なぜ「共通化」を考え、失敗を繰り返すのか
rinchoku
1
670
AIプログラマーDevinは PHPerの夢を見るか?
shinyasaita
1
240
テストから始めるAgentic Coding 〜Claude Codeと共に行うTDD〜 / Agentic Coding starts with testing
rkaga
15
5.3k
Featured
See All Featured
Visualization
eitanlees
146
16k
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Measuring & Analyzing Core Web Vitals
bluesmoon
7
510
Building Flexible Design Systems
yeseniaperezcruz
328
39k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
248
1.3M
Fantastic passwords and where to find them - at NoRuKo
philnash
51
3.3k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
30
2.1k
Speed Design
sergeychernyshev
32
1k
Bootstrapping a Software Product
garrettdimon
PRO
307
110k
Scaling GitHub
holman
460
140k
Practical Orchestrator
shlominoach
189
11k
The Straight Up "How To Draw Better" Workshop
denniskardys
235
140k
Transcript
webscraping with rvest & purrr @maxhumber useR!2017 2017-07-06
how to make money* with rvest & purrr**
how to make money* with rvest & purrr**
* *probably not. and i’m not liable if you lose
money. disclaimer. disclaimer. disclaimer.
* *probably not. and i’m not liable if you lose
money. disclaimer. disclaimer. disclaimer.
whoami
None
None
None
None
None
https://github.com/hadley/rvest https://github.com/tidyverse/purrr
fake money
None
None
$143.09 $334.64
None
$143.09 $334.64 H$59.91 H$709.32
H$709.32
None
moviestock ticker starbond price (in $m) that market thinks movie
will make in first 4 weeks @ B/O
H$720.33 H$812.73
None
None
starbond ticker
starbond ticker
arbitrage
1 2 3 4 5
1 2 3 4 5 6.82 1.98 3.71 812.73 6.91
1 2 3 4 5 6.82 1.98 3.71 812.73 6.91
1 2 3 4 5 46.35 6.82 1.98 3.71 812.73
6.91
1 2 3 4 5 6.82 1.98 3.71 812.73 6.91
46.35
automate
chrome extension: http://selectorgadget.com/
library(tidyverse) library(rvest) library(stringr) url <- "https://www.hsx.com/security/view/ADRIV" movie <- read_html(url) %>%
html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] df <- tibble(date, movie, link)
library(tidyverse) library(rvest) library(stringr) url <- "https://www.hsx.com/security/view/ADRIV" movie <- read_html(url) %>%
html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] df <- tibble(date, movie, link) text value selector
library(tidyverse) library(rvest) library(stringr) url <- "https://www.hsx.com/security/view/ADRIV" movie <- read_html(url) %>%
html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] df <- tibble(date, movie, link) text value underlying link ticker garbage selector
library(tidyverse) library(rvest) library(stringr) url <- "https://www.hsx.com/security/view/ADRIV" movie <- read_html(url) %>%
html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] df <- tibble(date, movie, link) text value underlying link ticker garbage selector
get_movies <- function(ticker) { url <- str_c( "https://www.hsx.com/security/view/", toupper(ticker)) movie
<- read_html(url) %>% html_nodes(".credit a") %>% html_text() link <- read_html(url) %>% html_nodes(".credit a") %>% html_attr("href") date <- read_html(url) %>% html_nodes("strong") %>% html_text() %>% .[1:length(link)] return(df) } wrap in function
View(adam_driver) adam_driver <- get_movies('adriv')
None
None
None
get_price <- function(link) { if (substr(link, 0, 1) == '/')
{ url <- str_c('https://www.hsx.com', link) } else { url <- str_c('https://www.hsx.com/security/view/',toupper(link)) } df <- read_html(url) %>% html_nodes(".value") %>% html_text() %>% as_tibble() %>% mutate(value = parse_number( str_extract(., "(?<=\\$)[^\\s]+"))) %>% rename(price = value) %>% bind_cols(link = tibble(link)) return(df) } get_price('/security/view/STAR8') get_price('adriv')
get_price <- function(link) { if (substr(link, 0, 1) == '/')
{ url <- str_c('https://www.hsx.com', link) } else { url <- str_c('https://www.hsx.com/security/view/',toupper(link)) } df <- read_html(url) %>% html_nodes(".value") %>% html_text() %>% as_tibble() %>% mutate(value = parse_number( str_extract(., "(?<=\\$)[^\\s]+"))) %>% rename(price = value) %>% bind_cols(link = tibble(link)) return(df) } get_price('/security/view/STAR8') get_price('adriv')
get_price <- function(link) { if (substr(link, 0, 1) == '/')
{ url <- str_c('https://www.hsx.com', link) } else { url <- str_c('https://www.hsx.com/security/view/',toupper(link)) } df <- read_html(url) %>% html_nodes(".value") %>% html_text() %>% as_tibble() %>% mutate(value = parse_number( str_extract(., "(?<=\\$)[^\\s]+"))) %>% rename(price = value) %>% bind_cols(link = tibble(link)) return(df) } get_price('/security/view/STAR8') get_price('adriv')
get_price <- function(link) { if (substr(link, 0, 1) == '/')
{ url <- str_c('https://www.hsx.com', link) } else { url <- str_c('https://www.hsx.com/security/view/',toupper(link)) } df <- read_html(url) %>% html_nodes(".value") %>% html_text() %>% as_tibble() %>% mutate(value = parse_number( str_extract(., "(?<=\\$)[^\\s]+"))) %>% rename(price = value) %>% bind_cols(link = tibble(link)) return(df) } get_price('/security/view/STAR8') get_price('adriv')
> get_price(‘/security/view/STAR8') # A tibble: 1 x 2 price link
<dbl> <chr> 1 720.33 /security/view/STAR8 > get_price(‘adriv') # A tibble: 1 x 2 price link <dbl> <chr> 1 59.91 adriv
> get_price(‘/security/view/STAR8') # A tibble: 1 x 2 price link
<dbl> <chr> 1 720.33 /security/view/STAR8 > get_price(‘adriv') # A tibble: 1 x 2 price link <dbl> <chr> 1 59.91 adriv
None
get_prices <- function(df) { params <- df %>% select(link) prices
<- pmap(params, get_price) %>% bind_rows() df <- prices %>% left_join(df, by = "link") return(df) } adam_driver <- get_movies('adriv') %>% get_prices(.)
get_prices <- function(df) { params <- df %>% select(link) prices
<- pmap(params, get_price) %>% bind_rows() df <- prices %>% left_join(df, by = "link") return(df) } adam_driver <- get_movies('adriv') %>% get_prices(.)
get_prices <- function(df) { params <- df %>% select(link) prices
<- pmap(params, get_price) %>% bind_rows() df <- prices %>% left_join(df, by = "link") return(df) } adam_driver <- get_movies('adriv') %>% get_prices(.)
clean
forward_tag <- function(df) { tag <- df %>% mutate(date =
as.Date(date, "%b %d, %Y")) %>% drop_na() %>% mutate(days = date - Sys.Date()) %>% mutate(future = ifelse(days >= 0, 'Yes', 'No')) %>% mutate(days = abs(days)) %>% group_by(future) %>% arrange(desc(future), days) %>% mutate(idx = row_number()) %>% ungroup() %>% filter((idx == 1 & future == 'Yes') | (future == "No" & idx <= 5)) forward <- tag %>% filter(idx <= 4) %>% pull(price) %>% mean(.) %>% round(., 2) return(forward) }
adam_driver <- get_movies('adriv') %>% get_prices(.) get_price('adriv') %>% pull(price) [1] 59.91
forward_tag(adam_driver) [1] 173.89
adam_driver <- get_movies('adriv') %>% get_prices(.) get_price('adriv') %>% pull(price) [1] 59.91
forward_tag(adam_driver) [1] 173.89
real money**
None
None
run the ball throw the ball catch the ball hybrid
+kick the ball +entire defence catch the ball
data
None
None
fetch_espn <- function(position = 0, offset = 0) { url
<- str_c(sep = "", "http://games.espn.com/ffl/tools/projections?", "&slotCategoryId=", position, "&startIndex=", offset) page <- read_html(url) df <- page %>% html_node("#playertable_0") %>% html_table() return(df) } df <- fetch_espn(position = 16, offset = 0)
fetch_proj <- function(position = 0, offset = 0) { url
<- str_c(sep = "", "http://games.espn.com/ffl/tools/projections?", "&slotCategoryId=", position, "&startIndex=", offset) page <- read_html(url) df <- page %>% html_node("#playertable_0") %>% html_table() return(df) } df <- fetch_proj(position = 0, offset = 0)
df <- fetch_proj(position = 0, offset = 0) View(df)
params <- expand.grid( position = c(0, 2, 4, 6, 16,
17), offset = seq(0, 320, 40)) espn_raw <- pmap(params, fetch_espn) %>% bind_rows()
params <- expand.grid( position = c(0, 2, 4, 6, 16,
17), offset = seq(0, 320, 40)) espn_raw <- pmap(params, fetch_espn) %>% bind_rows()
params <- expand.grid( position = c(0, 2, 4, 6, 16,
17), offset = seq(0, 320, 40)) espn_raw <- pmap(params, fetch_espn) %>% bind_rows() fetch_espn(position = __, offset = __ )
None
params <- expand.grid( position = c(0, 2, 4, 6, 16,
17), offset = seq(0, 320, 40)) raw <- pmap(params, fetch_proj) %>% bind_rows()
View(raw)
clean
proj <- raw %>% mutate(PLAYERS = ifelse(is.na(PLAYERS), `DEFENSIVE PLAYERS`, PLAYERS))
%>% select(name = PLAYERS, points = TOTAL) %>% mutate(points = parse_number(points)) %>% drop_na() %>% separate(name, into = c("name", "metadata"), sep = ", ", fill = "right") %>% mutate(name = str_replace(name, "D\\/ST\\sD\\/ST|\\*$","")) %>% mutate(metadata = str_replace(metadata, "^[^\\s]*\\s","")) %>% mutate(metadata = str_replace(metadata, "IR|Q$|O|SSPD|D", "")) %>% mutate(metadata = str_trim(metadata)) %>% mutate(position = ifelse(is.na(metadata), "DEF", metadata)) %>% select(position, name, points)
View(proj)
bad <- proj %>% group_by(position) %>% mutate(avg = mean(points)) %>%
mutate(vorp = points - avg) %>% select(position, name, points, vorp) %>% arrange(desc(vorp))
x2 x1 x3 x1 K x1 DEF x1 10 people
params <- tribble( ~pos, ~slots, "QB", 1, "WR", 3, "RB",
2, "TE", 1, "K", 1, "DEF", 1 ) %>% mutate(slots = slots * 10) replacement_player <- function(pos, slots) { rp <- proj %>% filter(position == pos) %>% arrange(desc(points)) %>% filter(row_number() <= slots) %>% group_by(position) %>% summarise(rp = mean(points)) return(rp) } rp <- pmap(params, replacement_player) %>% bind_rows()
rp <- pmap(params, replacement_player) %>% bind_rows() proj_vorp <- proj %>%
left_join(rp, by = "position") %>% mutate(vorp = points - rp) %>% select(position, name, points, vorp) %>% arrange(desc(vorp))
None
* *probably not. and i’m not liable if you lose
money. disclaimer. disclaimer. disclaimer.
preferably with rvest & purrr
obey robots.txt
thanks!!
maxhumber