Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up for free
人間じゃなくて野球のためのスクレイピングとしてのrequests-html / HTML Parsing for Baseball Player
Shinichi Nakagawa
June 26, 2021
Programming
1
180
人間じゃなくて野球のためのスクレイピングとしてのrequests-html / HTML Parsing for Baseball Player
kawasaki.rb #097 9年目突入LT大会 (オンライン) 記念LT
#Python #requests-html #Web #Baseball
Shinichi Nakagawa
June 26, 2021
Tweet
Share
More Decks by Shinichi Nakagawa
See All by Shinichi Nakagawa
BigQueryとPythonではじめるプロ野球選手の成績予測(もしくは成績占い) / Baseball Player Performance Prediction using BigQuery and Python
shinyorke
0
87
突然ですが「生涯成績」占ってもいいですか? - プロ野球選手成績予測2022
shinyorke
0
58
実践Streamlit & Flask - AIプロジェクトをいい感じにする技術 / Service development with Streamlit and Flask
shinyorke
3
8.3k
Flask + Google App Engine(GAE)でWeb APIをデプロイするまで - Github Actionsを添えて / Flask + App Engine + Github Actions
shinyorke
0
610
実践Streamlit & Flask - AIプロジェクトをいい感じにする技術(予告編) / Web application development starting with Streamlit
shinyorke
0
210
StreamlitとFlaskではじめる, 爆速プロトタイピングとTV砲対策 #jx_tech_talk / AI web application development starting with Streamlit and Flask
shinyorke
0
3.1k
坂本勇人さん改め山田哲人さんの成績予測をやってみた / Baseball Play Study 2020 Winter
shinyorke
0
2.3k
坂本勇人選手はいつ通算3,000安打を達成するか? AIに聞いてみました / Hayato Sakamoto Performance Prediction Using Feature Engineering with Machine Learning and Python
shinyorke
1
600
特徴量エンジニアリングと野球選手の成績予測 - 野球ではじめる機械学習 / Baseball Player Performance Prediction Using Feature Engineering with Machine Learning and Python
shinyorke
6
7.8k
Other Decks in Programming
See All in Programming
A technique to implement DSL in Ruby
okuramasafumi
0
830
LOWYAの信頼性向上とNew Relic
kazumax55
4
370
Learning DDD輪読会#4 / Learning DDD Book Club #4
suzushin54
1
160
あなたの会社の古いシステム、なんとかしませんか?~システム刷新から考えるDX化への道筋とバリエーション~/webinar20220420-grapecity
grapecity_dev
0
140
SRE bridge the gap: Feature development to Core API / 機能開発チームとコアAPIチームの架け橋としてのSRE
kenzan100
1
530
CIでAndroidUIテストの様子を録画してみた
mkeeda
0
190
WindowsコンテナDojo:第2回 Windowsコンテナアプリのビルド、公開、デプロイ
oniak3ibm
PRO
0
160
マイクロインタラクション入門〜ディテイルにこだわるエンジニアリング〜
swimmyxox
0
120
未経験QAの私が、よきQA(Question Asker) になっていく物語
atamaplus
0
380
スモールチームがAmazon Cognitoでコスパよく作るサービス間連携認証
tacke_jp
2
920
【Qiita Night】新卒エンジニアによるSwift6与太予想
eiji127
0
190
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
4
650
Featured
See All Featured
What's new in Ruby 2.0
geeforr
336
30k
Stop Working from a Prison Cell
hatefulcrawdad
261
17k
Support Driven Design
roundedbygravity
86
8.5k
Statistics for Hackers
jakevdp
781
210k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
181
15k
Writing Fast Ruby
sferik
612
57k
StorybookのUI Testing Handbookを読んだ
zakiyama
4
2k
Teambox: Starting and Learning
jrom
121
7.6k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
119
28k
Learning to Love Humans: Emotional Interface Design
aarron
261
37k
jQuery: Nuts, Bolts and Bling
dougneiner
56
6.4k
Faster Mobile Websites
deanohume
294
28k
Transcript
ਓؒ͡Όͳͯ͘ ٿͷͨΊͷ εΫϨΠϐϯάͱͯ͠ͷ requests-html ͘͠ʮٿͰ͡ΊΔػցֶशୈೋষʯ Shinichi Nakagawa(@shinyorke)
ࠓͷ͓ͳ͠⽁ • ⚾AIͷ༧ଌσʔλΛಘΔͨΊͷΫϩʔϥʔΛ requests-htmlͰ։ൃ&ʢࡶͰ͕͢ʣެ։ͨ͠ • Cloud Functions + Pub/Sub
+ SchedulerͰ ͬ͘͞ΓͰ͖ͪΌ͏ऩूαʔϏε • Scrapyͱ͔৭ʑ͚ͬͨͲࠓͩͱrequests-html͔ͳ͋
Who am I?ʢ͓લ୭Αʣ • Shinichi Nakagawa(@shinyorke) • JX௨৴ࣾγχΞΤϯδχΞ • ٿσʔλαΠΤϯςΟετ
• #kwskrb Λ #kwskpy ͱ͔ݴͬͯ͠·͏ਓ • #kwskrb 9प͓ΊͰͱ͏͍͟͝·͢🎉
͜Εͷٕज़తͳωλ͕ࠓͷ ٿAI͕બͿTOKYO 2020ࣆJAPAN24໊ - ػցֶशͰແ͘બΜͰΈͨ. https://shinyorke.hatenablog.com/entry/tokyo2020-samurai-japan
ٿAIʹΑΔࣆδϟύϯબग़ 1.ϝδϟʔϦʔάͷΦʔϓϯσʔλΛͬͯ ٿબखͷ༧ଌϞσϧΛ։ൃ 2.1.ͷ༧ଌϞσϧʹ2021ϓϩٿʮ΄΅ʯશબखͷΛ ৯Θͤͯ2021ͷΛউखʹ༧ଌ 3.༧ଌͷOPSʢଧऀʣ, FIPʢखʣͰྑ͔ͬͨॱ
&ϙδγϣϯɾଧͷࠨӈΛௐͯ͠24໊Λબग़
None
༧ଌσʔλͷ݅ʢ=ಛྔूΊʣ • खɾଧऀͷجຊతͳʢଧ, ଧ, ޚ, ඃຊྥଧetc…ʣ • ग़ϙδγϣϯ. Ͱ͖Εελϝϯͱͯ͠ͷճ͕·͍͠. •
্هΛσʔλߏɾϥΠηϯεڞʹͳ͘ΕΔσʔλ͕ ΞϝϦΧʹ͋ͬͨ, Baseball Referenceͬͯͭ. • https://www.baseball-reference.com/register/league.cgi?id=16632292 https://www.baseball-reference.com/register/league.cgi?id=0549ac26
requests-htmlͰటष͘ΫϩʔϥʔΛ࡞Δ • ʢٿAIͷ݅ͱผͷͰʣࠓͲ͖ͷΫϩʔϥʔͬͯ🤔 ͱ, ࣗࣾSlackͷtimesνϟϯωϧͰᄁ͍ͨΒrequests-htmlΛ קΊΒΕͨ • ৮ͬͨΒ͔֬ʹ͍͍ײͩͬͨ͡
-> ؾ͕͚ͭΫϩʔϥʔ requests-htmlϝΠϯʹ • ઌड़ͷٿσʔλऩूrequests-htmlͰ࡞ͬͨ https://github.com/Shinichi-Nakagawa/br-scraping-npb
requests-htmlͷྑ͔ͬͨͱ͜Ζ • γϯϓϧʹ͍͍͢ʢࡶʣ • ٿͷϖʔδ͕JSΰϦΰϦͷهड़͕ͩͬͨ render()ҰൃͰHTMLͱͯ͠औΕͨ • ਓؒΒ͍͔͠Ͳ͏͔ո͍͚͠Ͳ
खஈͱͯ͠ྑ͍ͷͰͳ͍Ͱ͠ΐ͏͔
JS->HTML͕͜ΕͰࡁΜͩ # νʔϜ͝ͱ, खͱख, ͚ͯอଘ for team in teams :
response = session.get(team['url'] ) response.html.render(timeout=60) # ίίͰJS͕HTMLʹϨϯμϦϯά͞ΕΔ tbody = response.html.find('#team_batting > tbody', first=True ) batters = players(tbody ) write_csv(f'dataset/player_batter_{team["team"].replace(" ", "")}.csv', batters, fieldnames ) tbody = response.html.find('#team_pitching > tbody', first=True ) pitchers = players(tbody ) write_csv(f'dataset/player_pitcher_{team["team"].replace(" ", "")}.csv', pitchers, fieldnames ) https://github.com/Shinichi-Nakagawa/br-scraping-npb/blob/main/players.py#L28
ఆظతʹಈ͔͢Ϋϩʔϥʔͱͯ͠ӡ༻ • AIࣆJAPANҰճϙοΩϦͷϓϩδΣΫτͳͷͰ͍͍ͱͯ͠ • ݸਓతʹຖूΊͯΔσʔλ͕͋ͬͨΓ͢Δ αΠτऩूͯ͠SlackʹͭͿ͔ͤͨΓBigQueryʹอଘͨ͠Γ • requests-htmlΛͬͨίʔυΛ
GCF + Pub/Sub + SchedulerͰӡ༻
࣮ࡍӡ༻͍ͯ͠·͢ খ͍͞ϓϩμΫτ։ൃʹ͓͚ΔGCPར༻ͷצͲ͜Ζ - ݸਓతͳϓϩμΫτΛࡾͰϩʔϯνͨ͠ https://shinyorke.hatenablog.com/entry/gcp-slack-taida
݁ͼ • ࠓͲ͖ͷPythonͷΫϩʔϥʔ։ൃ, requests-html͕ͤ • ScrapyΈ͍ͨʹԿͰग़དྷΔΘ͚͡Όͳ͍͚Ͳ ॳखͷಋೖίετͱ͔͍͠Φεεϝ. • Google
Cloud Functionsʢͬͯͳ͍͚ͲʣAWS LambdaͰ ࡶʹӡ༻͢Δͷʹ߹ͬͯΔͱࢥΘΕ. ۩ମྫ͍ͣΕϒϩάʹ.
ήʔϜηοτ⽁