Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ScrapyとRedashではじめる野球統計学 #PyConKuma

ScrapyとRedashではじめる野球統計学 #PyConKuma

福岡ソフトバンクホークスと北海道日本ハムファイターズの分析をScrapy,Redashでやりました.

PyCon mini Kumamoto 2017発表資料 http://kumamoto.pycon.jp/

#Python #Baseball

Shinichi Nakagawa

April 23, 2017
Tweet

More Decks by Shinichi Nakagawa

Other Decks in Technology

Transcript

  1. Who am I ?(͓લ୭Α) • Shinichi Nakagawa • @shinyorke /

    ໺ٿͷਓ • Retty.Inc Engineer/ڕྉཧ୲౰ • Python/໺ٿ/ηΠόʔϝτϦΫε/Agile • PythonͰ໺ٿσʔλ෼ੳ&AgileͳϒϩάΛॻ͍ͯΔਓ
  2. ϒϩάʮLean Baseballʯ • http://shinyorke.hatenablog.com/ • ໺ٿɾPythonɾAgileத৺ͷϒϩάͰ͢ • ʮPython ຊʯʮ໺ٿ౷ܭֶʯʮηΠόʔϝτϦΫεʯͰΑ͘ݕࡧ͞Ε·͢(ײँ) •

    ॳ৺ऀ͕PythonΛ֮͑ΔҝͷຊͷબͼํΛମܥԽͯ͠Έͨ(2017൛) • ໺ٿ޷͖͕౷ܭֶΛ֮͑ΔҝͷֶशϑϩʔΛ࡞ͬͯΈ·ͨ͠ • ʲ໺ٿʳ30෼ͰΘ͔ΔηΠόʔϝτϦΫε • ࠓ೔ͷൃදͷิ଍ɾղઆ͸͍ͣΕϒϩάʹॻ͘༧ఆͰ͢
  3. Starting member(͓͠ͳ͕͖) • ScrapyͱRedashͰ࡞Δ໺ٿ෼ੳج൫ • ScrapyͰσʔλΛूΊΔ • ηΠόʔϝτϦΫεࢦඪΛࢉग़͢ΔύοέʔδΛ࡞ͬͨ&࢖ͬͨ • RedashͰՄࢹԽ͢Δ

    • ෱ԬιϑτόϯΫϗʔΫεVS๺ւಓ೔ຊϋϜϑΝΠλʔζʙηΠόʔϝτϦΫεΛఴ͑ͯ • ಘ఺૑ग़ೳྗ(RC/RC27) • ಘ఺Ձ஋(wOBA/wRAA) • ·ͱΊ
  4. ScrapyʙΫϩʔϥʔFW • WebαΠτͷΫϩʔϧͱεΫϨΠϐϯά,σʔλͷอଘͳ ͲΛҰؾ௨؏ʹߦ͑ΔΫϩʔϥʔFW • ΫϩʔϥʔքͷDjango/Ruby On RailsͱݺΜͰ͍͍ଘࡏ • εέδϡʔϥʔ,UserAgent,HTTP

    Header,μ΢ϯϩʔυͷ λΠϛϯά,Ωϟογϡetc…ඞཁʹͳΔ΋ͷ͕͋Β͔͡Ί ༻ҙ͞Ε͍ͯΔ&ύϥϝʔλͷઃఆͳͲͰ؆୯ʹઃఆՄೳ
  5. Spider(ଧऀσʔλΛऔΔ&อଘ) from scrapy import Spider, Request from ..items import BaseballNpbPlayer

    as Player class BatterStatsSpider(Spider): name = 'player_stats' allowed_domains = ['example.com'] start_urls = _get_start_urls() BASE_URL = ‘http://example.com' def parse(self, response): for row in response.xpath('//table[@class="NpbPlSt mb10"]/tr'): if len(row.xpath('td[2]').extract()) > 0: name = row.xpath('td[2]/a/text()').extract()[0] url = ''.join([self.BASE_URL, row.xpath('td[2]/a/@href').extract()[0]]) yield Request(url, callback=self.parse_player) def parse_player(self, response): """ ίʔϧόοΫؔ਺ """ stats = Player() # লུ yield stats
  6. Item(ଧऀσʔλΛอଘ) import scrapy class BaseballNpbPlayer(scrapy.Item): # profile name = scrapy.Field()

    # ໊લ url = scrapy.Field() # URL born = scrapy.Field() # ੜ೥݄೔ throw_bat = scrapy.Field() # ར͖࿹/ଧ੮ team = scrapy.Field() # νʔϜ # batting ba = scrapy.Field() # ଧ཰ hr = scrapy.Field() # ϗʔϜϥϯ rbi = scrapy.Field() # ଧ఺ # লུ obp = scrapy.Field() # ग़ྥ཰ slg = scrapy.Field() # ௕ଧ཰ ops = scrapy.Field() # OPS
  7. Item Pipeline(SQLAlchemyͷྫ) from configparser import ConfigParser import logging from sqlalchemy

    import create_engine from sqlalchemy.orm import sessionmaker from .models import PlayerBatting, PlayerPitching # SQLAlchemyͷϞσϧΫϥεͱࢥ͍ͬͯͩ͘͞ class BaseballLabPipeline(object): URL = '{dialect}+{driver}://{user}:{password}@{host}:{port}/{database}?charset={encoding}' def open_spider(self, spider): """ Spider͔Βݺ͹Εͨ࣌ͷॳظॲཧ """ config = ConfigParser() config.read('./config.ini') params = dict(config['mysql']) engine = create_engine(cls.URL.format(**params), encoding=params.get('encoding')) Session = sessionmaker() Session.configure(bind=engine) self.session = Session() def close_spider(self, spider): """ ऴྃॲཧ """ try: logging.info('close spider') finally: self.session.close()
  8. Item Pipeline(SQLAlchemyͷྫ) @classmethod def create_model(cls, item, spider): """ ଧऀ/౤खσʔλͷΠϯελϯεΛฦ͢ """

    if spider.name == 'player_batting': return PlayerBatting(**item) elif spider.name == 'player_pitching': return PlayerPitching(**item) else: raise Exception('Spider not found{}'.format(spider.name)) def process_item(self, item, spider): """ ItemΛDBʹอଘ """ try: self.session.add(BaseballLabPipeline.create_model(item, spider)) self.session.commit() except Exception as e: self.session.rollback() logging.error(e) logging.error(item)
  9. ScrapyͰ΍ͬͨ͜ͱ • Spider • ໺ٿબखϖʔδ(࡯͠)ͷϦετΛparse()Ͱऔಘ • callbackؔ਺಺ʹબख(ݸਓ)ͷεΫϨΠϐϯά࣮૷ • Item •

    ଧऀɾ౤खɾνʔϜຖʹItemΫϥεΛ࡞੒ • ϝϯόʔ͸໺ٿࢦඪ஋ͷུশ(୹ͯ͘ಡΈ΍͍͢) • Item Pipeline • SQLAlchemy(O/R mapping)ΛऔΓࠐΜͰָʹॻ͚ΔΑ͏ʹ • ॲཧର৅ͷSpiderʹ߹ΘͤͯอଘઌϞσϧ(SQLAlchemy)Λมߋ
  10. ScrapyҎ֎Ͱ޻෉ͨ͠ͱ͜Ζ • ʲProblemʳSpider಺ͷࢦඪ஋ܭࢉ͕൥ࡶԽ • RC/RC27(ಘ఺૑ग़ೳྗ) • wOBA(ॏΈ෇͖ग़ྥ཰) • wRAA(ଧऀͷଧܸߩݙ౓) •

    Spiderʹ௚઀ॻ͘಺༰͡Όͳ͍&ଞͷ෼ੳͰ΋࢖͍·Θ͍ͨ͠ʂ • ʲTryʳηΠόʔϝτϦΫεࢦඪΛPythonͷύοέʔδͱͯ͠੾Γग़ͨ͠
  11. SABR(Example) $ pip install sabr $ python >>> import sabr

    >>> from sabr.stats import Stats >>> Stats.hr9(26, 209.7) # Yu Darvish(2013) HR/9 1.1
  12. Redash • WebαʔϏεͰ࢖͏σʔλΛάϥϑͰՄࢹԽ͢ΔͨΊͷπʔϧ • RDBMS, KVS, Google Analytics,etc…༷ʑͳσʔλιʔεʹରԠ • Ϋϥ΢υαʔϏε༗Γ(༗ྉ)ɺιʔεͦͷ΋ͷ͕OSS


    AWS(EC2)/GCE/DockerͰΠϝʔδ΋ެ։͞Ε͍ͯΔ • ϝΠϯͷΞϓϦέʔγϣϯ͸Flask(Python) + Angular JS
 Mail΍SlackͳͲ΁ͷ௨஌ʹ΋ରԠ • ݸਓͰSQLΛ͍ͬͨ͡ΓάϥϑΛͪΐ͍ͪΐ͍ඳ͘໨తͰ࢖͍΍͍͢
  13. RC(SABRΑΓൈਮ) def rc(cls, tb, h, bb, hbp, cs, gidp, sf,

    sh, sb, so, ab, ibb): """ Runs Created :param tb: total bases :param h: hits :param bb: base on ball :param hbp: hit by pitch :param cs: caught stealing :param gidp: ground into duble play :param tb: total bases :param sf: sacrifice fly :param sh: sacrifice hit :param sb: stolen base :param so: strike out :param ab: at bat :param ibb: intentional base on balls :return: (float) run created """ # (ग़ྥೳྗA * ਐྥೳྗB) / ग़ྥػձC a = float(h + bb + hbp - cs - gidp) b = float(tb) + round(0.24 * float(bb + hbp - ibb), 1) + round(0.62 * float(sb), 1)\ + round(0.5 * float(sh + sf), 1) - round(0.03 * float(so), 1) c = float(ab + bb + hbp + sf + sh) a_b = round(a + 2.4 * c) * (b + 3.0 * c) _9c = round(9.0 * c, 1) _09c = round(0.9 * c, 1) rc = round(a_b / _9c - _09c, 2) return rc
  14. RC27(Run Created / 27outs) • RC(Run Created)Λ27Ξ΢τ(≒Ұࢼ߹෼)ʹ׳Βͨ͠ࢦඪ • “9ਓେ୩ͷଧઢ VS

    9ਓΪʔλͷଧઢͲ͕ͬͪڧ͍?”
 తͳ໰୊Λղ͘ʹଧ͚ͬͯͭͷࢦඪ • RC27 =((ग़ྥೳྗA × ਐྥೳྗB) / ग़ྥػձC) / 27outs
 ※େࡶ೺ͳࣜͷߏ଄ • ৄ͍͠ߟ͑ํ͸΢ΟΩϖσΟΞʮRCʢ໺ٿʣʯΛࢀর
  15. RC27(SABRΑΓൈਮ) def rc27(cls, rc, ab, h, sh, sf, cs, gidp):

    """ Runs created 27 :param rc: run created :param ab: at bat :param h: hits :param sh: sacrifice hit :param sf: sacrifice fly :param cs: caught stealing :param gidp: ground into duble play :return: (float) run created 27 """ to = ab - h + sh + sf + cs + gidp rc27 = round(27 * rc / to, 2) return rc27
  16. RC/RC27ͷ௕ॴͱ୹ॴ • ʲ௕ॴʳ໺ٿͷಘ఺ߏ଄Λਖ਼֬ʹଊ͍͑ͯΔ(Θ͔Γ΍͍͢) • ग़ྥೳྗɾਐྥೳྗΛ෼͚ͯධՁ • ্هΛग़ྥػձͰׂΔ͜ͱʹΑΓγϯϓϧʹಘ఺ߏ଄Λදݱ • ʲ୹ॴʳʮ৔໘ʯ͕͍࣋ͬͯΔಘ఺ͷՄೳੑΛແࢹ͍ͯ͠Δ •

    ৔໘=Ξ΢τΧ΢ϯτ×૸ऀͷ਺(24௨Γ) • ແࢮ૸ऀແ͠Ͱग़Δ୯ଧͱೋྥଧ,Ͳ͕ͬͪಘ఺ʹܨ͕Δ?
 …ͱ͍͏໰ʹ౴͑ΒΕͳ͍(ϓϨʔ͝ͱͷධՁΛ͍ͯ͠ͳ͍ͨΊ) • ্هͷ୹ॴΛղܾ͢΂͘,ʮಘ఺ظ଴஋ʯͱʮಘ఺Ձ஋ʯΛ༻͍Δ͜ͱʹͨ͠
  17. ಘ఺ظ଴஋ͱಘ఺Ձ஋ • ಘ఺ظ଴஋(Run Expectancy) • ࢼ߹ͷ৔໘͝ͱʹಘ఺ͷੜ·Ε΍͢͞Λ਺஋Խͨ͠΋ͷ • ৔໘(24௨Γ)͝ͱͷʮ૯ಘ఺/৔໘ͷػձ਺ʯΛࢉग़ • ಘ఺Ձ஋(Run

    Value) • ಘ఺ظ଴஋Λݩʹ,ʮϓϨʔ͕ಘ఺ʹରͯ͠ͲΕ͙Β͍Ձ஋͕͋Δ͔ʯΛධՁ͢Δߟ͑ํ • ʲྫʳແࢮҰྥ͔Β୯ଧˠಘ఺ظ଴஋͕0.4→1.0ʹ্ঢˠࠩ෼ͷ0.6͕ಘ఺Ձ஋ʂ • ΋ͪΖΜ,Լ͕Δ͜ͱ΋͋Δ(Ξ΢τΛॏͶΔͱ͔ࡾৼ͢ΔͳͲ) ৄࡉͳղઆ͸ https://ja.wikipedia.org/wiki/Linear_Weights ͋ͨΓʹ͋Γ·͢.
  18. wOBA(SABRΑΓൈਮ) def woba_npb(cls, bb, hbp, _1b, _2b, _3b, hr, ab,

    sf, ibb=0, e_bb=0): """ Weighted on-base average for NPB(wOBA) http://1point02.jp/ :param bb: base on ball :param hbp: hit by pitch :param _1b: single :param _2b: double :param _3b: triple :param hr: home run :param ab: at bat :param sf: sacrifice fly :param ibb: intentional base on balls(default:0) :param e_bb: base on ball for error(default:0) :return: (float) wOBA """ u_bb = round(0.692 * float(bb-ibb), 3) u_hbp = round(float(0.73 * hbp), 3) u_e_bb = round(0.966 * float(e_bb), 3) u_h = round(0.865 * float(_1b), 3) + round(1.334 * float(_2b), 3)\ + round(1.725 * (_3b), 3) + round(2.065 * float(hr), 3) u_pa = round(float(ab + bb - ibb + hbp + sf), 3) return round((u_bb + u_hbp + u_e_bb + u_h) / u_pa, 3)
  19. wRAA • Weighted Runs Above Average(ଧऀͷଧܸߩݙ౓) • ʮฏۉతͳଧऀ͕ಉ͡ଧ੮਺ཱͬͨ৔߹ʹൺ΂ͯ૿΍ͨ͠ಘ ఺ʯΛද͢ࢦඪ •

    wRAA=(ଧऀͷwOBA-ϦʔάwOBA) / wOBAscale * ଧ੮਺ • wOBAͷ୯ҐΛಘ఺ʹ໭͢͜ͱʹΑΓ,ಘ఺ྗΛग़͢ • ໺ٿબखͷ૯߹ࢦඪʮWARʯͷଧܸߩݙ͸wRAAΛݩʹग़͢
  20. wRAA(SABRΑΓൈਮ) def wraa(cls, woba, lg_woba, pa, woba_scale=1.24): """ Weighted Runs

    Above Average(wRAA) http://1point02.jp/ :param woba: weighted on-base average :param lg_woba: weighted on-base average(league average) :param woba_scale: Weighted Runs Above Average scale(default:1.2) :param pa: plate appearance :param woba_scale: weighted on-base average scale(default:1.24) :return: (float) wRAA """ return round(((woba - lg_woba) / woba_scale) * float(pa), 1)
  21. ʲࢀߟʳwOBA/wRAAૣݟද ධՁ X0#" X3"" ૉ੖Β͍͠   ඇৗʹྑ͍  

    ฏۉҎ্   ฏۉ   ฏۉҎԼ   ѱ͍   ඇৗʹѱ͍   ΢ΟΩϖσΟΞʮwOBAʢ໺ٿʣʯΑΓൈਮ
  22. Python × Retty • ࠓޙ΋ΠϕϯτεϙϯαʔͳͲΛ௨ͯ͡ίϛϡχςΟʔΛ੝Γ ্͍͖͛ͯ·͢ʂ • σʔλαΠΤϯε͚ͩͰͳ͘,ϓϩμΫτνʔϜ΋PythonԽΛ ਐΊ͍ͯ·͢ •

    ڵຯ͕͋Δํ͸্ژͷࡍʹRettyʹ༡ͼʹདྷ͍ͯͩ͘͞ʂ • http://corp.retty.me/ • https://www.wantedly.com/companies/retty