Upgrade to Pro — share decks privately, control downloads, hide ads and more …

人間じゃなくて野球のためのスクレイピングとしてのrequests-html / HTML Parsing for Baseball Player

人間じゃなくて野球のためのスクレイピングとしてのrequests-html / HTML Parsing for Baseball Player

kawasaki.rb #097 9年目突入LT大会 (オンライン) 記念LT

#Python #requests-html #Web #Baseball

2c0947c6a28e7f771ebd9859ecf54e5c?s=128

Shinichi Nakagawa

June 26, 2021
Tweet

Transcript

  1. ਓؒ͡Όͳͯ͘ ໺ٿͷͨΊͷ εΫϨΠϐϯάͱͯ͠ͷ requests-html ΋͘͠͸ʮ໺ٿͰ͸͡ΊΔػցֶशୈೋষʯ Shinichi Nakagawa(@shinyorke)

  2. ࠓ೔ͷ͓͸ͳ͠⽁ • ⚾AIͷ༧ଌσʔλΛಘΔͨΊͷΫϩʔϥʔΛ 
 requests-htmlͰ։ൃ&ʢࡶͰ͕͢ʣެ։ͨ͠ • Cloud Functions + Pub/Sub

    + SchedulerͰ 
 ͬ͘͞ΓͰ͖ͪΌ͏ऩूαʔϏε • Scrapyͱ͔৭ʑ΍͚ͬͨͲࠓͩͱrequests-html͔ͳ͋
  3. Who am I?ʢ͓લ୭Αʣ • Shinichi Nakagawa(@shinyorke) • JX௨৴ࣾγχΞΤϯδχΞ • ໺ٿσʔλαΠΤϯςΟετ

    • #kwskrb Λ #kwskpy ͱ͔ݴͬͯ͠·͏ਓ • #kwskrb 9प೥͓ΊͰͱ͏͍͟͝·͢🎉
  4. ͜Εͷٕज़తͳωλ͕ࠓ೔ͷ࿩ ໺ٿAI͕બͿTOKYO 2020ࣆJAPAN24໊ - ػցֶशͰ዁౓ແ͘બΜͰΈͨ. https://shinyorke.hatenablog.com/entry/tokyo2020-samurai-japan

  5. ໺ٿAIʹΑΔࣆδϟύϯબग़ 1.ϝδϟʔϦʔάͷΦʔϓϯσʔλΛ࢖ͬͯ 
 ໺ٿબखͷ੒੷༧ଌϞσϧΛ։ൃ 2.1.ͷ༧ଌϞσϧʹ2021೥ϓϩ໺ٿʮ΄΅ʯશબखͷ੒੷Λ 
 ৯Θͤͯ2021೥ͷ੒੷Λউखʹ༧ଌ 3.༧ଌ੒੷ͷOPSʢଧऀʣ, FIPʢ౤खʣͰྑ͔ͬͨॱ 


    &ϙδγϣϯɾ౤ଧͷࠨӈΛௐ੔ͯ͠24໊Λબग़
  6. None
  7. ༧ଌσʔλͷ৚݅ʢ=ಛ௃ྔूΊʣ • ౤खɾଧऀͷجຊతͳ੒੷ʢଧ཰, ଧ఺, ๷ޚ཰, ඃຊྥଧetc…ʣ • ग़৔ϙδγϣϯ. Ͱ͖Ε͹ελϝϯͱͯ͠ͷճ਺͕๬·͍͠. •

    ্هΛσʔλߏ଄ɾϥΠηϯεڞʹ໰୊ͳ͘΍ΕΔσʔλ͕ 
 ΞϝϦΧʹ͋ͬͨ, Baseball Referenceͬͯ΍ͭ. • https://www.baseball-reference.com/register/league.cgi?id=16632292 
 https://www.baseball-reference.com/register/league.cgi?id=0549ac26
  8. requests-htmlͰటष͘ΫϩʔϥʔΛ࡞Δ • ʢ໺ٿAIͷ݅ͱ͸ผͷ࿩୊ͰʣࠓͲ͖ͷΫϩʔϥʔͬͯ🤔 
 ͱ, ࣗࣾSlackͷtimesνϟϯωϧͰᄁ͍ͨΒrequests-htmlΛ 
 קΊΒΕͨ • ৮ͬͨΒ͔֬ʹ͍͍ײͩͬͨ͡

    -> ؾ͕͚ͭ͹Ϋϩʔϥʔ͸ 
 requests-htmlϝΠϯʹ • ઌड़ͷ໺ٿσʔλऩू΋requests-htmlͰ࡞ͬͨ 
 https://github.com/Shinichi-Nakagawa/br-scraping-npb
  9. requests-htmlͷྑ͔ͬͨͱ͜Ζ • γϯϓϧʹ࢖͍΍͍͢ʢࡶʣ • ໺ٿͷϖʔδ͕JSΰϦΰϦͷهड़͕ͩͬͨ 
 render()ҰൃͰHTMLͱͯ͠औΕͨ • ਓؒΒ͍͔͠Ͳ͏͔͸ո͍͚͠Ͳ 


    खஈͱͯ͠ྑ͍ͷͰ͸ͳ͍Ͱ͠ΐ͏͔
  10. JS->HTML͕͜ΕͰࡁΜͩ # νʔϜ͝ͱ, ౤खͱ໺ख, ෼͚ͯอଘ for team in teams :

    response = session.get(team['url'] ) response.html.render(timeout=60) # ίίͰJS͕HTMLʹϨϯμϦϯά͞ΕΔ tbody = response.html.find('#team_batting > tbody', first=True ) batters = players(tbody ) write_csv(f'dataset/player_batter_{team["team"].replace(" ", "")}.csv', batters, fieldnames ) tbody = response.html.find('#team_pitching > tbody', first=True ) pitchers = players(tbody ) write_csv(f'dataset/player_pitcher_{team["team"].replace(" ", "")}.csv', pitchers, fieldnames ) https://github.com/Shinichi-Nakagawa/br-scraping-npb/blob/main/players.py#L28
  11. ఆظతʹಈ͔͢Ϋϩʔϥʔͱͯ͠ӡ༻ • AIࣆJAPAN͸ҰճϙοΩϦͷϓϩδΣΫτͳͷͰ͍͍ͱͯ͠ • ݸਓతʹຖ೔ूΊͯΔσʔλ͕͋ͬͨΓ͢Δ 
 αΠτऩूͯ͠SlackʹͭͿ΍͔ͤͨΓBigQueryʹอଘͨ͠Γ • requests-htmlΛ࢖ͬͨίʔυΛ 


    GCF + Pub/Sub + SchedulerͰӡ༻
  12. ࣮ࡍӡ༻͍ͯ͠·͢ খ͍͞ϓϩμΫτ։ൃʹ͓͚ΔGCPར༻ͷצͲ͜Ζ - ݸਓతͳϓϩμΫτΛࡾ೔Ͱϩʔϯνͨ͠࿩ 
 https://shinyorke.hatenablog.com/entry/gcp-slack-taida

  13. ݁ͼ • ࠓͲ͖ͷPythonͷΫϩʔϥʔ։ൃ, requests-html͕޾ͤ • ScrapyΈ͍ͨʹԿͰ΋ग़དྷΔΘ͚͡Όͳ͍͚Ͳ 
 ॳखͷಋೖίετͱ͔௿͍͠Φεεϝ. • Google

    Cloud Functions΍ʢ΍ͬͯͳ͍͚ͲʣAWS LambdaͰ 
 ࡶʹӡ༻͢Δͷʹ߹ͬͯΔͱࢥΘΕ. ۩ମྫ͸͍ͣΕϒϩάʹ.
  14. ήʔϜηοτ⽁