Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
人間じゃなくて野球のためのスクレイピングとしてのrequests-html / HTML Pa...
Search
Shinichi Nakagawa
PRO
June 26, 2021
Programming
350
1
Share
人間じゃなくて野球のためのスクレイピングとしてのrequests-html / HTML Parsing for Baseball Player
kawasaki.rb #097 9年目突入LT大会 (オンライン) 記念LT
#Python #requests-html #Web #Baseball
Shinichi Nakagawa
PRO
June 26, 2021
More Decks by Shinichi Nakagawa
See All by Shinichi Nakagawa
野球解説AI Agentを開発してみた - 2026/02/27 LayerX社内LT会資料
shinyorke
PRO
0
460
WBCの解説は生成AIにやらせよう - 生成AIで野球解説者AI Agentを実現する / Baseball Commentator AI Agent for Gemini
shinyorke
PRO
1
440
自らを強いエンジニアにするための3つの習慣 2025/ Fitter happier more productive
shinyorke
PRO
0
290
生成AI時代におけるSREの進化とキャリア戦略 / Building an Embedded SRE team and my career
shinyorke
PRO
0
160
生成AIを活用した野球データ分析 - メジャーリーグ編 / Baseball Analytics for Gen AI
shinyorke
PRO
1
6.3k
ゼロから始めるSREの事業貢献 - 生成AI時代のSRE成長戦略と実践 / Starting SRE from Day One
shinyorke
PRO
3
7.8k
AI・LLM事業部のSREとタスクの自動運転
shinyorke
PRO
0
550
実践Dash - 手を抜きながら本気で作るデータApplicationの基本と応用 / Dash for Python and Baseball
shinyorke
PRO
2
4.5k
Terraform, GitHub Actions, Cloud Buildでデータ基盤をProvisioningする / Data Platform provisioning for Google Cloud and Terraform
shinyorke
PRO
2
3.7k
Other Decks in Programming
See All in Programming
ECR拡張スキャンでSBOMを収集して サプライチェーン攻撃の影響調査を 爆速で終わらせてみた
akihisaikeda
2
210
Inspired By RubyKaigi (EN)
atzzcokek
0
380
TSKaigi2026-静的解析への投資がAI時代のコード品質を支える ── カスタムESLintルールの設計と運用
hayatokudou
6
1.3k
運用エージェントは "作る" から "育てる" へ - 記憶と自己進化の3層設計パターン / self-evolving-agents-three-layer-agent-design
gawa
12
3.2k
The Arts and Crafts of Work in the AI Era — Toward Mastery in Software Development
kuranuki
1
660
Modding RubyKaigi for Myself
yui_knk
0
750
CSC307 Lecture 17
javiergs
PRO
0
260
Spec-Driven Development with AI-Agents: From High-Level Requirements to Working Software
antonarhipov
2
380
分析エージェント精度向上における データアナリストの役割
oura_shoya
0
130
AIエージェントの隔離技術の徹底比較
kawayu
0
440
気づいたらRubyで100作品 ー クリエイティブコーディングが生活の一部になるまで / 100 Ruby Sketches Later: How Creative Coding Became Part of My Life
chobishiba
3
460
Augmenting AI with the Power of Jakarta EE
ivargrimstad
0
320
Featured
See All Featured
A Tale of Four Properties
chriscoyier
163
24k
YesSQL, Process and Tooling at Scale
rocio
174
15k
Gemini Prompt Engineering: Practical Techniques for Tangible AI Outcomes
mfonobong
2
410
SEOcharity - Dark patterns in SEO and UX: How to avoid them and build a more ethical web
sarafernandez
0
190
How People are Using Generative and Agentic AI to Supercharge Their Products, Projects, Services and Value Streams Today
helenjbeal
1
190
16th Malabo Montpellier Forum Presentation
akademiya2063
PRO
0
130
We Analyzed 250 Million AI Search Results: Here's What I Found
joshbly
1
1.3k
HTML-Aware ERB: The Path to Reactive Rendering @ RubyCon 2026, Rimini, Italy
marcoroth
1
120
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
55
3.4k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
17k
Site-Speed That Sticks
csswizardry
13
1.2k
Done Done
chrislema
186
16k
Transcript
ਓؒ͡Όͳͯ͘ ٿͷͨΊͷ εΫϨΠϐϯάͱͯ͠ͷ requests-html ͘͠ʮٿͰ͡ΊΔػցֶशୈೋষʯ Shinichi Nakagawa(@shinyorke)
ࠓͷ͓ͳ͠⽁ • ⚾AIͷ༧ଌσʔλΛಘΔͨΊͷΫϩʔϥʔΛ requests-htmlͰ։ൃ&ʢࡶͰ͕͢ʣެ։ͨ͠ • Cloud Functions + Pub/Sub
+ SchedulerͰ ͬ͘͞ΓͰ͖ͪΌ͏ऩूαʔϏε • Scrapyͱ͔৭ʑ͚ͬͨͲࠓͩͱrequests-html͔ͳ͋
Who am I?ʢ͓લ୭Αʣ • Shinichi Nakagawa(@shinyorke) • JX௨৴ࣾγχΞΤϯδχΞ • ٿσʔλαΠΤϯςΟετ
• #kwskrb Λ #kwskpy ͱ͔ݴͬͯ͠·͏ਓ • #kwskrb 9प͓ΊͰͱ͏͍͟͝·͢🎉
͜Εͷٕज़తͳωλ͕ࠓͷ ٿAI͕બͿTOKYO 2020ࣆJAPAN24໊ - ػցֶशͰແ͘બΜͰΈͨ. https://shinyorke.hatenablog.com/entry/tokyo2020-samurai-japan
ٿAIʹΑΔࣆδϟύϯબग़ 1.ϝδϟʔϦʔάͷΦʔϓϯσʔλΛͬͯ ٿબखͷ༧ଌϞσϧΛ։ൃ 2.1.ͷ༧ଌϞσϧʹ2021ϓϩٿʮ΄΅ʯશબखͷΛ ৯Θͤͯ2021ͷΛউखʹ༧ଌ 3.༧ଌͷOPSʢଧऀʣ, FIPʢखʣͰྑ͔ͬͨॱ
&ϙδγϣϯɾଧͷࠨӈΛௐͯ͠24໊Λબग़
None
༧ଌσʔλͷ݅ʢ=ಛྔूΊʣ • खɾଧऀͷجຊతͳʢଧ, ଧ, ޚ, ඃຊྥଧetc…ʣ • ग़ϙδγϣϯ. Ͱ͖Εελϝϯͱͯ͠ͷճ͕·͍͠. •
্هΛσʔλߏɾϥΠηϯεڞʹͳ͘ΕΔσʔλ͕ ΞϝϦΧʹ͋ͬͨ, Baseball Referenceͬͯͭ. • https://www.baseball-reference.com/register/league.cgi?id=16632292 https://www.baseball-reference.com/register/league.cgi?id=0549ac26
requests-htmlͰటष͘ΫϩʔϥʔΛ࡞Δ • ʢٿAIͷ݅ͱผͷͰʣࠓͲ͖ͷΫϩʔϥʔͬͯ🤔 ͱ, ࣗࣾSlackͷtimesνϟϯωϧͰᄁ͍ͨΒrequests-htmlΛ קΊΒΕͨ • ৮ͬͨΒ͔֬ʹ͍͍ײͩͬͨ͡
-> ؾ͕͚ͭΫϩʔϥʔ requests-htmlϝΠϯʹ • ઌड़ͷٿσʔλऩूrequests-htmlͰ࡞ͬͨ https://github.com/Shinichi-Nakagawa/br-scraping-npb
requests-htmlͷྑ͔ͬͨͱ͜Ζ • γϯϓϧʹ͍͍͢ʢࡶʣ • ٿͷϖʔδ͕JSΰϦΰϦͷهड़͕ͩͬͨ render()ҰൃͰHTMLͱͯ͠औΕͨ • ਓؒΒ͍͔͠Ͳ͏͔ո͍͚͠Ͳ
खஈͱͯ͠ྑ͍ͷͰͳ͍Ͱ͠ΐ͏͔
JS->HTML͕͜ΕͰࡁΜͩ # νʔϜ͝ͱ, खͱख, ͚ͯอଘ for team in teams :
response = session.get(team['url'] ) response.html.render(timeout=60) # ίίͰJS͕HTMLʹϨϯμϦϯά͞ΕΔ tbody = response.html.find('#team_batting > tbody', first=True ) batters = players(tbody ) write_csv(f'dataset/player_batter_{team["team"].replace(" ", "")}.csv', batters, fieldnames ) tbody = response.html.find('#team_pitching > tbody', first=True ) pitchers = players(tbody ) write_csv(f'dataset/player_pitcher_{team["team"].replace(" ", "")}.csv', pitchers, fieldnames ) https://github.com/Shinichi-Nakagawa/br-scraping-npb/blob/main/players.py#L28
ఆظతʹಈ͔͢Ϋϩʔϥʔͱͯ͠ӡ༻ • AIࣆJAPANҰճϙοΩϦͷϓϩδΣΫτͳͷͰ͍͍ͱͯ͠ • ݸਓతʹຖूΊͯΔσʔλ͕͋ͬͨΓ͢Δ αΠτऩूͯ͠SlackʹͭͿ͔ͤͨΓBigQueryʹอଘͨ͠Γ • requests-htmlΛͬͨίʔυΛ
GCF + Pub/Sub + SchedulerͰӡ༻
࣮ࡍӡ༻͍ͯ͠·͢ খ͍͞ϓϩμΫτ։ൃʹ͓͚ΔGCPར༻ͷצͲ͜Ζ - ݸਓతͳϓϩμΫτΛࡾͰϩʔϯνͨ͠ https://shinyorke.hatenablog.com/entry/gcp-slack-taida
݁ͼ • ࠓͲ͖ͷPythonͷΫϩʔϥʔ։ൃ, requests-html͕ͤ • ScrapyΈ͍ͨʹԿͰग़དྷΔΘ͚͡Όͳ͍͚Ͳ ॳखͷಋೖίετͱ͔͍͠Φεεϝ. • Google
Cloud Functionsʢͬͯͳ͍͚ͲʣAWS LambdaͰ ࡶʹӡ༻͢Δͷʹ߹ͬͯΔͱࢥΘΕ. ۩ମྫ͍ͣΕϒϩάʹ.
ήʔϜηοτ⽁