Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
人間じゃなくて野球のためのスクレイピングとしてのrequests-html / HTML Pa...
Search
Shinichi Nakagawa
PRO
June 26, 2021
Programming
1
300
人間じゃなくて野球のためのスクレイピングとしてのrequests-html / HTML Parsing for Baseball Player
kawasaki.rb #097 9年目突入LT大会 (オンライン) 記念LT
#Python #requests-html #Web #Baseball
Shinichi Nakagawa
PRO
June 26, 2021
Tweet
Share
More Decks by Shinichi Nakagawa
See All by Shinichi Nakagawa
実践Dash - 手を抜きながら本気で作るデータApplicationの基本と応用 / Dash for Python and Baseball
shinyorke
PRO
2
1.9k
Terraform, GitHub Actions, Cloud Buildでデータ基盤をProvisioningする / Data Platform provisioning for Google Cloud and Terraform
shinyorke
PRO
2
2.9k
Cloud RunとCloud PubSubでサーバレスなデータ基盤2024 with Terraform / Cloud Run and PubSub with Terraform
shinyorke
PRO
9
3.3k
自らを強いエンジニアにするための3つの習慣 / I need to be myself, I can't be no one else
shinyorke
PRO
79
79k
阪神タイガース優勝のひみつ - Pythonでシュッと調べた件 / SABRmetrics for Python
shinyorke
PRO
1
1.3k
Pythonとクラウドと野球の推し活. / Baseball Data Platform for Python and Google Cloud
shinyorke
PRO
2
2.8k
月額コーヒー3.34杯分のコストでオオタニサンの活躍を見守るデータ基盤のはなし / Pyhack Con
shinyorke
PRO
2
480
俺のDXを実現するためのサーバレスなデータ基盤開発と運用 / Serverless Data Platform and Baseball
shinyorke
PRO
5
12k
機械学習エンジニアが目指すキャリアパスとその実話 / My Journey to Become a ML Engineer
shinyorke
PRO
9
17k
Other Decks in Programming
See All in Programming
AHC041解説
terryu16
0
400
Запуск 1С:УХ в крупном энтерпрайзе: мечта и реальность ПМа
lamodatech
0
950
Simple組み合わせ村から大都会Railsにやってきた俺は / Coming to Rails from the Simple
moznion
3
2.2k
最近のVS Codeで気になるニュース 2025/01
74th
1
100
Fibonacci Function Gallery - Part 2
philipschwarz
PRO
0
210
ゼロからの、レトロゲームエンジンの作り方
tokujiros
3
1.1k
歴史と現在から考えるスケーラブルなソフトウェア開発のプラクティス
i10416
0
300
DevFest - Serverless 101 with Google Cloud Functions
tunmise
0
140
CQRS+ES の力を使って効果を感じる / Feel the effects of using the power of CQRS+ES
seike460
PRO
0
240
混沌とした例外処理とエラー監視に秩序をもたらす
morihirok
13
2.3k
functionalなアプローチで動的要素を排除する
ryopeko
1
210
Lookerは可視化だけじゃない。UIコンポーネントもあるんだ!
ymd65536
1
130
Featured
See All Featured
Into the Great Unknown - MozCon
thekraken
34
1.6k
Bootstrapping a Software Product
garrettdimon
PRO
305
110k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
127
18k
YesSQL, Process and Tooling at Scale
rocio
170
14k
Scaling GitHub
holman
459
140k
Large-scale JavaScript Application Architecture
addyosmani
510
110k
What’s in a name? Adding method to the madness
productmarketing
PRO
22
3.2k
The Straight Up "How To Draw Better" Workshop
denniskardys
232
140k
Optimising Largest Contentful Paint
csswizardry
33
3k
GraphQLとの向き合い方2022年版
quramy
44
13k
[RailsConf 2023] Rails as a piece of cake
palkan
53
5.1k
Building Better People: How to give real-time feedback that sticks.
wjessup
366
19k
Transcript
ਓؒ͡Όͳͯ͘ ٿͷͨΊͷ εΫϨΠϐϯάͱͯ͠ͷ requests-html ͘͠ʮٿͰ͡ΊΔػցֶशୈೋষʯ Shinichi Nakagawa(@shinyorke)
ࠓͷ͓ͳ͠⽁ • ⚾AIͷ༧ଌσʔλΛಘΔͨΊͷΫϩʔϥʔΛ requests-htmlͰ։ൃ&ʢࡶͰ͕͢ʣެ։ͨ͠ • Cloud Functions + Pub/Sub
+ SchedulerͰ ͬ͘͞ΓͰ͖ͪΌ͏ऩूαʔϏε • Scrapyͱ͔৭ʑ͚ͬͨͲࠓͩͱrequests-html͔ͳ͋
Who am I?ʢ͓લ୭Αʣ • Shinichi Nakagawa(@shinyorke) • JX௨৴ࣾγχΞΤϯδχΞ • ٿσʔλαΠΤϯςΟετ
• #kwskrb Λ #kwskpy ͱ͔ݴͬͯ͠·͏ਓ • #kwskrb 9प͓ΊͰͱ͏͍͟͝·͢🎉
͜Εͷٕज़తͳωλ͕ࠓͷ ٿAI͕બͿTOKYO 2020ࣆJAPAN24໊ - ػցֶशͰແ͘બΜͰΈͨ. https://shinyorke.hatenablog.com/entry/tokyo2020-samurai-japan
ٿAIʹΑΔࣆδϟύϯબग़ 1.ϝδϟʔϦʔάͷΦʔϓϯσʔλΛͬͯ ٿબखͷ༧ଌϞσϧΛ։ൃ 2.1.ͷ༧ଌϞσϧʹ2021ϓϩٿʮ΄΅ʯશબखͷΛ ৯Θͤͯ2021ͷΛউखʹ༧ଌ 3.༧ଌͷOPSʢଧऀʣ, FIPʢखʣͰྑ͔ͬͨॱ
&ϙδγϣϯɾଧͷࠨӈΛௐͯ͠24໊Λબग़
None
༧ଌσʔλͷ݅ʢ=ಛྔूΊʣ • खɾଧऀͷجຊతͳʢଧ, ଧ, ޚ, ඃຊྥଧetc…ʣ • ग़ϙδγϣϯ. Ͱ͖Εελϝϯͱͯ͠ͷճ͕·͍͠. •
্هΛσʔλߏɾϥΠηϯεڞʹͳ͘ΕΔσʔλ͕ ΞϝϦΧʹ͋ͬͨ, Baseball Referenceͬͯͭ. • https://www.baseball-reference.com/register/league.cgi?id=16632292 https://www.baseball-reference.com/register/league.cgi?id=0549ac26
requests-htmlͰటष͘ΫϩʔϥʔΛ࡞Δ • ʢٿAIͷ݅ͱผͷͰʣࠓͲ͖ͷΫϩʔϥʔͬͯ🤔 ͱ, ࣗࣾSlackͷtimesνϟϯωϧͰᄁ͍ͨΒrequests-htmlΛ קΊΒΕͨ • ৮ͬͨΒ͔֬ʹ͍͍ײͩͬͨ͡
-> ؾ͕͚ͭΫϩʔϥʔ requests-htmlϝΠϯʹ • ઌड़ͷٿσʔλऩूrequests-htmlͰ࡞ͬͨ https://github.com/Shinichi-Nakagawa/br-scraping-npb
requests-htmlͷྑ͔ͬͨͱ͜Ζ • γϯϓϧʹ͍͍͢ʢࡶʣ • ٿͷϖʔδ͕JSΰϦΰϦͷهड़͕ͩͬͨ render()ҰൃͰHTMLͱͯ͠औΕͨ • ਓؒΒ͍͔͠Ͳ͏͔ո͍͚͠Ͳ
खஈͱͯ͠ྑ͍ͷͰͳ͍Ͱ͠ΐ͏͔
JS->HTML͕͜ΕͰࡁΜͩ # νʔϜ͝ͱ, खͱख, ͚ͯอଘ for team in teams :
response = session.get(team['url'] ) response.html.render(timeout=60) # ίίͰJS͕HTMLʹϨϯμϦϯά͞ΕΔ tbody = response.html.find('#team_batting > tbody', first=True ) batters = players(tbody ) write_csv(f'dataset/player_batter_{team["team"].replace(" ", "")}.csv', batters, fieldnames ) tbody = response.html.find('#team_pitching > tbody', first=True ) pitchers = players(tbody ) write_csv(f'dataset/player_pitcher_{team["team"].replace(" ", "")}.csv', pitchers, fieldnames ) https://github.com/Shinichi-Nakagawa/br-scraping-npb/blob/main/players.py#L28
ఆظతʹಈ͔͢Ϋϩʔϥʔͱͯ͠ӡ༻ • AIࣆJAPANҰճϙοΩϦͷϓϩδΣΫτͳͷͰ͍͍ͱͯ͠ • ݸਓతʹຖूΊͯΔσʔλ͕͋ͬͨΓ͢Δ αΠτऩूͯ͠SlackʹͭͿ͔ͤͨΓBigQueryʹอଘͨ͠Γ • requests-htmlΛͬͨίʔυΛ
GCF + Pub/Sub + SchedulerͰӡ༻
࣮ࡍӡ༻͍ͯ͠·͢ খ͍͞ϓϩμΫτ։ൃʹ͓͚ΔGCPར༻ͷצͲ͜Ζ - ݸਓతͳϓϩμΫτΛࡾͰϩʔϯνͨ͠ https://shinyorke.hatenablog.com/entry/gcp-slack-taida
݁ͼ • ࠓͲ͖ͷPythonͷΫϩʔϥʔ։ൃ, requests-html͕ͤ • ScrapyΈ͍ͨʹԿͰग़དྷΔΘ͚͡Όͳ͍͚Ͳ ॳखͷಋೖίετͱ͔͍͠Φεεϝ. • Google
Cloud Functionsʢͬͯͳ͍͚ͲʣAWS LambdaͰ ࡶʹӡ༻͢Δͷʹ߹ͬͯΔͱࢥΘΕ. ۩ମྫ͍ͣΕϒϩάʹ.
ήʔϜηοτ⽁