Upgrade to Pro — share decks privately, control downloads, hide ads and more …

人間じゃなくて野球のためのスクレイピングとしてのrequests-html / HTML Parsing for Baseball Player

人間じゃなくて野球のためのスクレイピングとしてのrequests-html / HTML Parsing for Baseball Player

kawasaki.rb #097 9年目突入LT大会 (オンライン) 記念LT

#Python #requests-html #Web #Baseball

Shinichi Nakagawa

June 26, 2021
Tweet

More Decks by Shinichi Nakagawa

Other Decks in Programming

Transcript

  1. ࠓ೔ͷ͓͸ͳ͠⽁ • ⚾AIͷ༧ଌσʔλΛಘΔͨΊͷΫϩʔϥʔΛ 
 requests-htmlͰ։ൃ&ʢࡶͰ͕͢ʣެ։ͨ͠ • Cloud Functions + Pub/Sub

    + SchedulerͰ 
 ͬ͘͞ΓͰ͖ͪΌ͏ऩूαʔϏε • Scrapyͱ͔৭ʑ΍͚ͬͨͲࠓͩͱrequests-html͔ͳ͋
  2. Who am I?ʢ͓લ୭Αʣ • Shinichi Nakagawa(@shinyorke) • JX௨৴ࣾγχΞΤϯδχΞ • ໺ٿσʔλαΠΤϯςΟετ

    • #kwskrb Λ #kwskpy ͱ͔ݴͬͯ͠·͏ਓ • #kwskrb 9प೥͓ΊͰͱ͏͍͟͝·͢🎉
  3. ༧ଌσʔλͷ৚݅ʢ=ಛ௃ྔूΊʣ • ౤खɾଧऀͷجຊతͳ੒੷ʢଧ཰, ଧ఺, ๷ޚ཰, ඃຊྥଧetc…ʣ • ग़৔ϙδγϣϯ. Ͱ͖Ε͹ελϝϯͱͯ͠ͷճ਺͕๬·͍͠. •

    ্هΛσʔλߏ଄ɾϥΠηϯεڞʹ໰୊ͳ͘΍ΕΔσʔλ͕ 
 ΞϝϦΧʹ͋ͬͨ, Baseball Referenceͬͯ΍ͭ. • https://www.baseball-reference.com/register/league.cgi?id=16632292 
 https://www.baseball-reference.com/register/league.cgi?id=0549ac26
  4. requests-htmlͰటष͘ΫϩʔϥʔΛ࡞Δ • ʢ໺ٿAIͷ݅ͱ͸ผͷ࿩୊ͰʣࠓͲ͖ͷΫϩʔϥʔͬͯ🤔 
 ͱ, ࣗࣾSlackͷtimesνϟϯωϧͰᄁ͍ͨΒrequests-htmlΛ 
 קΊΒΕͨ • ৮ͬͨΒ͔֬ʹ͍͍ײͩͬͨ͡

    -> ؾ͕͚ͭ͹Ϋϩʔϥʔ͸ 
 requests-htmlϝΠϯʹ • ઌड़ͷ໺ٿσʔλऩू΋requests-htmlͰ࡞ͬͨ 
 https://github.com/Shinichi-Nakagawa/br-scraping-npb
  5. JS->HTML͕͜ΕͰࡁΜͩ # νʔϜ͝ͱ, ౤खͱ໺ख, ෼͚ͯอଘ for team in teams :

    response = session.get(team['url'] ) response.html.render(timeout=60) # ίίͰJS͕HTMLʹϨϯμϦϯά͞ΕΔ tbody = response.html.find('#team_batting > tbody', first=True ) batters = players(tbody ) write_csv(f'dataset/player_batter_{team["team"].replace(" ", "")}.csv', batters, fieldnames ) tbody = response.html.find('#team_pitching > tbody', first=True ) pitchers = players(tbody ) write_csv(f'dataset/player_pitcher_{team["team"].replace(" ", "")}.csv', pitchers, fieldnames ) https://github.com/Shinichi-Nakagawa/br-scraping-npb/blob/main/players.py#L28
  6. ݁ͼ • ࠓͲ͖ͷPythonͷΫϩʔϥʔ։ൃ, requests-html͕޾ͤ • ScrapyΈ͍ͨʹԿͰ΋ग़དྷΔΘ͚͡Όͳ͍͚Ͳ 
 ॳखͷಋೖίετͱ͔௿͍͠Φεεϝ. • Google

    Cloud Functions΍ʢ΍ͬͯͳ͍͚ͲʣAWS LambdaͰ 
 ࡶʹӡ༻͢Δͷʹ߹ͬͯΔͱࢥΘΕ. ۩ମྫ͸͍ͣΕϒϩάʹ.