Slide 1

Slide 1 text

ਓؒ͡Όͳͯ͘ ໺ٿͷͨΊͷ εΫϨΠϐϯάͱͯ͠ͷ requests-html ΋͘͠͸ʮ໺ٿͰ͸͡ΊΔػցֶशୈೋষʯ Shinichi Nakagawa(@shinyorke)

Slide 2

Slide 2 text

ࠓ೔ͷ͓͸ͳ͠⽁ • ⚾AIͷ༧ଌσʔλΛಘΔͨΊͷΫϩʔϥʔΛ 
 requests-htmlͰ։ൃ&ʢࡶͰ͕͢ʣެ։ͨ͠ • Cloud Functions + Pub/Sub + SchedulerͰ 
 ͬ͘͞ΓͰ͖ͪΌ͏ऩूαʔϏε • Scrapyͱ͔৭ʑ΍͚ͬͨͲࠓͩͱrequests-html͔ͳ͋

Slide 3

Slide 3 text

Who am I?ʢ͓લ୭Αʣ • Shinichi Nakagawa(@shinyorke) • JX௨৴ࣾγχΞΤϯδχΞ • ໺ٿσʔλαΠΤϯςΟετ • #kwskrb Λ #kwskpy ͱ͔ݴͬͯ͠·͏ਓ • #kwskrb 9प೥͓ΊͰͱ͏͍͟͝·͢🎉

Slide 4

Slide 4 text

͜Εͷٕज़తͳωλ͕ࠓ೔ͷ࿩ ໺ٿAI͕બͿTOKYO 2020ࣆJAPAN24໊ - ػցֶशͰ዁౓ແ͘બΜͰΈͨ. https://shinyorke.hatenablog.com/entry/tokyo2020-samurai-japan

Slide 5

Slide 5 text

໺ٿAIʹΑΔࣆδϟύϯબग़ 1.ϝδϟʔϦʔάͷΦʔϓϯσʔλΛ࢖ͬͯ 
 ໺ٿબखͷ੒੷༧ଌϞσϧΛ։ൃ 2.1.ͷ༧ଌϞσϧʹ2021೥ϓϩ໺ٿʮ΄΅ʯશબखͷ੒੷Λ 
 ৯Θͤͯ2021೥ͷ੒੷Λউखʹ༧ଌ 3.༧ଌ੒੷ͷOPSʢଧऀʣ, FIPʢ౤खʣͰྑ͔ͬͨॱ 
 &ϙδγϣϯɾ౤ଧͷࠨӈΛௐ੔ͯ͠24໊Λબग़

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

༧ଌσʔλͷ৚݅ʢ=ಛ௃ྔूΊʣ • ౤खɾଧऀͷجຊతͳ੒੷ʢଧ཰, ଧ఺, ๷ޚ཰, ඃຊྥଧetc…ʣ • ग़৔ϙδγϣϯ. Ͱ͖Ε͹ελϝϯͱͯ͠ͷճ਺͕๬·͍͠. • ্هΛσʔλߏ଄ɾϥΠηϯεڞʹ໰୊ͳ͘΍ΕΔσʔλ͕ 
 ΞϝϦΧʹ͋ͬͨ, Baseball Referenceͬͯ΍ͭ. • https://www.baseball-reference.com/register/league.cgi?id=16632292 
 https://www.baseball-reference.com/register/league.cgi?id=0549ac26

Slide 8

Slide 8 text

requests-htmlͰటष͘ΫϩʔϥʔΛ࡞Δ • ʢ໺ٿAIͷ݅ͱ͸ผͷ࿩୊ͰʣࠓͲ͖ͷΫϩʔϥʔͬͯ🤔 
 ͱ, ࣗࣾSlackͷtimesνϟϯωϧͰᄁ͍ͨΒrequests-htmlΛ 
 קΊΒΕͨ • ৮ͬͨΒ͔֬ʹ͍͍ײͩͬͨ͡ -> ؾ͕͚ͭ͹Ϋϩʔϥʔ͸ 
 requests-htmlϝΠϯʹ • ઌड़ͷ໺ٿσʔλऩू΋requests-htmlͰ࡞ͬͨ 
 https://github.com/Shinichi-Nakagawa/br-scraping-npb

Slide 9

Slide 9 text

requests-htmlͷྑ͔ͬͨͱ͜Ζ • γϯϓϧʹ࢖͍΍͍͢ʢࡶʣ • ໺ٿͷϖʔδ͕JSΰϦΰϦͷهड़͕ͩͬͨ 
 render()ҰൃͰHTMLͱͯ͠औΕͨ • ਓؒΒ͍͔͠Ͳ͏͔͸ո͍͚͠Ͳ 
 खஈͱͯ͠ྑ͍ͷͰ͸ͳ͍Ͱ͠ΐ͏͔

Slide 10

Slide 10 text

JS->HTML͕͜ΕͰࡁΜͩ # νʔϜ͝ͱ, ౤खͱ໺ख, ෼͚ͯอଘ for team in teams : response = session.get(team['url'] ) response.html.render(timeout=60) # ίίͰJS͕HTMLʹϨϯμϦϯά͞ΕΔ tbody = response.html.find('#team_batting > tbody', first=True ) batters = players(tbody ) write_csv(f'dataset/player_batter_{team["team"].replace(" ", "")}.csv', batters, fieldnames ) tbody = response.html.find('#team_pitching > tbody', first=True ) pitchers = players(tbody ) write_csv(f'dataset/player_pitcher_{team["team"].replace(" ", "")}.csv', pitchers, fieldnames ) https://github.com/Shinichi-Nakagawa/br-scraping-npb/blob/main/players.py#L28

Slide 11

Slide 11 text

ఆظతʹಈ͔͢Ϋϩʔϥʔͱͯ͠ӡ༻ • AIࣆJAPAN͸ҰճϙοΩϦͷϓϩδΣΫτͳͷͰ͍͍ͱͯ͠ • ݸਓతʹຖ೔ूΊͯΔσʔλ͕͋ͬͨΓ͢Δ 
 αΠτऩूͯ͠SlackʹͭͿ΍͔ͤͨΓBigQueryʹอଘͨ͠Γ • requests-htmlΛ࢖ͬͨίʔυΛ 
 GCF + Pub/Sub + SchedulerͰӡ༻

Slide 12

Slide 12 text

࣮ࡍӡ༻͍ͯ͠·͢ খ͍͞ϓϩμΫτ։ൃʹ͓͚ΔGCPར༻ͷצͲ͜Ζ - ݸਓతͳϓϩμΫτΛࡾ೔Ͱϩʔϯνͨ͠࿩ 
 https://shinyorke.hatenablog.com/entry/gcp-slack-taida

Slide 13

Slide 13 text

݁ͼ • ࠓͲ͖ͷPythonͷΫϩʔϥʔ։ൃ, requests-html͕޾ͤ • ScrapyΈ͍ͨʹԿͰ΋ग़དྷΔΘ͚͡Όͳ͍͚Ͳ 
 ॳखͷಋೖίετͱ͔௿͍͠Φεεϝ. • Google Cloud Functions΍ʢ΍ͬͯͳ͍͚ͲʣAWS LambdaͰ 
 ࡶʹӡ༻͢Δͷʹ߹ͬͯΔͱࢥΘΕ. ۩ମྫ͸͍ͣΕϒϩάʹ.

Slide 14

Slide 14 text

ήʔϜηοτ⽁