Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PythonではじめるGISとデータ分析そして⚾ / Baseball Analytics a...

PythonではじめるGISとデータ分析そして⚾ / Baseball Analytics and GIS with Python

PyCon mini Hiroshima 2019 Talk Session #pyconhiro

Shinichi Nakagawa

October 12, 2019
Tweet

More Decks by Shinichi Nakagawa

Other Decks in Programming

Transcript

  1. Who am I?ʢ͓લ୭Αʣ • Shinichi Nakagawa(@shinyorke) • ͔ͭͯ໺ٿΤϯδχΞΛ࢓ࣄʹͯͨ͠ਓ • ઌ݄·Ͱɿʮϓϩʯͷ໺ٿΤϯδχΞ

    • ࠓ݄͔Βɿʮ໺ੜʯͷ໺ٿΤϯδχΞʢʹ෮ؼʣ • Python΋͘΋ࣗ͘शࣨ(#rettypy)ΦʔΨφΠβʔ • Web, σʔλαΠΤϯε, Opsʹ⽁ΛPythonͰ΍Δਓ
  2. ೔ຊͷ͓৓ͷछྨɹ˞ॾઆ͋Γ λΠϓ ୅ද֨ ໨తɾಛ௃ ࢁ৓
 ʢ΍·͡Ζʣ ஛ా৓ ذෞ৓ ɾࢁʹݐͯΔ৓Ͱओʹઓ૪໨త
 ɾ؂ࢹ༻ͷࡆɾญΛಗ͏അग़͠FUDʜ


    ɾਓ͸ݪଇॅ·ͳ͍ ઓͷ࣌ʹ᝷΋Δ ฏࢁ৓
 ʢͻΒ΍·͡Ζʣ ඣ࿏৓ ۽ຊ৓ ɾ܉ࣄɾ੓࣏ͷத৺Ͱٰɾখࢁʹங৓
 ɾઓ૪ͷࢪઃ ৓Լொ ډॅࢪઃ
 ɾࢁ৓ɾฏ৓ͷ͍͍ͱ͜औΓ ฏ৓ ʢͻΒ͡Ζʣ େࡔ৓ ޿ౡ৓ ɾ੓ܦத৺Ͱओʹฏ஍ʹங৓ ɾ੓ிɾډॅ஍ͱͯ͠ͷػೳ༏ઌ ɾ৓ͱ͍͏ΑΓٶ఼ͱ͔׭ఛ
  3. ύʔΫϑΝΫλʔ is Կ • ʮଞٿ৔ͱൺ΂ͯ, ϗʔϜϥϯʢ΋͘͠͸ಘ఺ʣ͕ग़΍͍͔͢൱͔ʁʯ
 Λࣔ͢ࢦඪ͕ύʔΫϑΝΫλʔʢPF, Park Factorʣ. •

    ຊྥଧ͕ର৅͕ͩ,ҎԼͷϝτϦΫεͰग़͢͜ͱ΋. • ಘ఺ʢຊྥଧͱಉ͘͡Β͍Α͘࢖ΘΕΔʣ • ௕ଧʢೋྥଧ, ࡾྥଧʣ • ୯ଧ΋͘͠͸ώοτ਺શମ • ฏۉతͳٿ৔͸1.0, ͜ΕΑΓଟ͍/গͳ͍ͰධՁ͢Δ.
  4. ਺ࣜ PF = {(A+B) / Home} / {(C+D) / Away}

    A:ຊڌ஍Ͱͷຊྥଧ਺
 B:ຊڌ஍Ͱͷඃຊྥଧ਺
 C:ଞٿ৔Ͱͷຊྥଧ਺
 D:ଞٿ৔Ͱͷඃຊྥଧ਺
 Home: ຊڌ஍ࢼ߹਺
 Away: ଞٿ৔ࢼ߹਺ ※ຊྥଧΛଞͷࢦඪʢಘ఺ͳͲʣʹมߋ͢ΔͱԠ༻Ͱ͖㽂
  5. ʲsampleʳPark Factor(งғؾ) # طʹSQLʹඞཁͳσʔλΛೖΕͯΔલఏͷίʔυ # ٿ৔Ϧετ(࠷ޙʹ͜ΕʹPFΛ଍ͯ͠CSVʹ͢Δ) df_parks = pd.read_sql("select park_id,

    home_team_id, name, lat, lng from parks where home_team_id !=''", connection) parks = {} for r in df_parks.to_dict('rows'): parks[r['home_team_id']] = r['park_id'] # தུ # 2018೥ͷΠϕϯτ৘ใʹٿ৔ίʔυΛ৐͚ͬΔ query = """ -- Πϕϯτ৘ใʹٿ৔ίʔυΛ͚ͬͭ͘Δ select g.game_id, g.park_id, g.home_team_id, g.away_team_id, e.inn_ct, e.bat_id, e.pit_id, e.origin from events as e inner join games g on e.game_id = g.game_id where g.game_dt between '{start}' and '{end}' """ df_event2018 = pd.read_sql(query.format(start='2018-03-01', end='2018-11-30'), connection)
  6. ʲsampleʳPark Factor(งғؾ) # ͢Ͱʹຊྥଧɾࡾྥଧɾೋྥଧɾ୯ଧΛલॲཧͯ͠਺͑ͯΔલఏ(ίʔυͷத਎͸ൿີ) teams = [] COLUMNS_HOME = ['home_team_id',

    'park_id', 'hr', 'triple', 'double', 'single', 'hits'] COLUMNS_AWAY = ['away_team_id', 'hr', 'triple', 'double', 'single', 'hits'] for team in df_event2018['home_team_id'].unique(): home_stadium = parks[team] _df_home = df_event2018.query(f'home_team_id == "{team}" and park_id == "{home_stadium}"') home_games = len(_df_home['game_id'].unique()) _df_home = _df_home[COLUMNS_HOME].groupby(['home_team_id', 'park_id'], as_index=False)[COLUMNS_HOME].sum() _df_away = df_event2018.query(f'away_team_id == "{team}"') away_games = len(_df_away['game_id'].unique()) _df_away = _df_away[COLUMNS_AWAY].groupby(['away_team_id'], as_index=False)[COLUMNS_AWAY].sum() teams.append({ 'park_id': home_stadium, 'team': team, 'home_games': home_games, 'home_hr': _df_home['hr'][0], 'home_triple': _df_home['triple'][0], 'home_double': _df_home['double'][0], 'home_single': _df_home['single'][0], 'home_hits': _df_home['hits'][0], 'away_games': away_games, 'away_hr': _df_away['hr'][0], 'away_triple': _df_away['triple'][0], 'away_double': _df_away['double'][0], 'away_single': _df_away['single'][0], 'away_hits': _df_away['hits'][0], }) df_pf = pd.DataFrame(teams)
  7. ʲsampleʳPark Factor(งғؾ) # ࢉग़͸γϯϓϧͳྻಉ࢜ͷԋࢉ # ຊྥଧ df_pf['pf_hr'] = (df_pf['home_hr']/ df_pf['home_games'])

    / (df_pf['away_hr']/ df_pf['away_games']) # ௕ଧ df_pf['pf_triple'] = (df_pf['home_triple']/ df_pf['home_games']) / (df_pf['away_triple']/ df_pf['away_games']) df_pf['pf_double'] = (df_pf['home_double']/ df_pf['home_games']) / (df_pf['away_double']/ df_pf['away_games']) # ୯ଧ df_pf['pf_single'] = (df_pf['home_single']/ df_pf['home_games']) / (df_pf['away_single']/ df_pf['away_games']) લॲཧؚΊͯ, ؤுΕ͹100ߦఔ౓ͷίʔυͰ ϝδϟʔϦʔάͷPFʢຊྥଧ, ࡾྥଧ, ೋྥଧ, ୯ଧʣ͕Ͱ͖Δ
  8. ຊྥଧ͕Α͘ೖΔٿ৔Top5 ॱҐ ٿ৔ ຊྥଧ1' ಛ௃  (SFBU"NFSJDBO#BMM1BSL  Ϩοζຊڌ஍ ֎໺ͱϑΝʔϧκʔϯڱ͍

     $PPST'JFME  ϩοΩʔζຊڌ஍ ඪߴ NͰؾѹ௿͍  (MPCF-JGF1BSLJO "SMJOHUPO  Ϩϯδϟʔζຊڌ஍ ϥΠτ͕લʹग़͍ͯͯଧऀ༗ར  $JUJ[FOT#BOL1BSL  ϑΟϦʔζຊڌ஍ ӈதؒɾࠨத͕ؒڱ͍  /BUJPOBMT1BSL  φγϣφϧζຊڌ஍ ͜͜΋ࠨதؒڱ͍
  9. ຊྥଧ͕ೖΓʹ͍͘ٿ৔Top5 ॱҐ ٿ৔ ຊྥଧ1' ಛ௃  .BSMJOT1BSL  ϚʔϦϯζຊڌ஍ ޿ͯ͘ւ͕͍ۙ

     "551BSL  δϟΠΞϯπຊڌ஍ ຊྥଧ͕ւʹೖΔ͙Β͍͍ۙ  0BLMBOE"MBNFEB$PVOUZ $PMJTFVN  ΞεϨνοΫεຊڌ஍ ϑΝʔϧκʔϯ͕༰ࣻͳ͘޿͍  4VOUSVTU1BSL  ϒϨʔϒεຊڌ஍ ೥Φʔϓϯͷ৽͍͠ٿ৔  1/$1BSL  ύΠϨʔπຊڌ஍ ࠨཌྷɾࠨதؒ޿͍ಛ௃తͳٿ৔
  10. ʲsampleʳGeoPyͰGeocoding import csv import time from geopy.geocoders import Nominatim #

    Geocoder(ͲͷαʔϏε࢖͏͔)ࢦఆ from geopy.exc import GeocoderTimedOut from retry import retry # ࠓճ͸OSMϕʔεͷ΋ͷΛ࢖͏ geoLocator = Nominatim(user_agent='Baseball Radar24 / 0.1 [email protected]’) # Geocoding͍ͯ͠Δͱ͜Ζ. งғؾΛݟͯRetry @retry((GeocoderTimedOut, ), delay=5, backoff=2, max_delay=4) def get_location(name, alias): loc = geoLocator.geocode(name) if not loc: loc = geoLocator.geocode(alias) return loc # ٿ৔໊ΛGeocodingͰ͖ΔΑ͏ʹͪΐͬ͜ͱ͚ͩΫϨϯδϯά def park_name(name): return name.replace('I', '').replace('II', '').replace('III', '').replace('IV', '').strip() Nominatimͱ͍͏OSMσʔλͷAPIͰGeocoding ٿ৔໊͸geocodersʹؾʹೖΒΕΔΑ͏ʁʹͪΐͬͱ͚ͩΫϨϯδϯά
  11. ʲsampleʳGeoPyͰGeocoding # ͔ͬ͜Β࣮ߦ # ٿ৔Ϧετ values = [] with open('./datasets/baseballdatabank/Parks.csv',

    'r') as f: reader = csv.DictReader(f) for r in reader: values.append(r) # GeocodingΛͻͨ͢Β࣮ߦ locations = [] for park in values: loc = get_location(park_name(park['park.name']), park_name(park['park.alias'])) if loc: locations.append( { 'id': park['park.key'], 'name': park['park.name'], 'lat': loc.latitude, 'lng': loc.longitude, 'address': loc.address, 'state': park['state'], 'country': park['country'] } ) else: print('geo not found: ', park['park.name'], park['park.key']) # CSVʹॻ͖ࠐΈ fields = ['id', 'name', 'lat', 'lng', 'address', 'state', 'country'] with open('./datasets/parklist.csv', 'w') as f: writer = csv.DictWriter(f, fieldnames=fields) writer.writeheader() for loc in locations: writer.writerow(loc) CSVΛಡΈࠐΜͰͻͨ͢ΒGeocodingͷؔ਺ΛݺͿ ͜ͷล͸ׂͱී௨ͷεΫϦϓτͩͬͨΓ͢Δ
  12. ໺ੜͷݚڀ = ڵຯͱISSUE • ʲڵຯʳ΋ͬͱٕज़ɾPythonΛ஌Γ͍ͨ • ͜Ε͔Β࢓ࣄͰ࢖͏΋ͷɾͦ΋ͦ΋Ͱ͖ͳ͍͜ͱΛશྗͰ༡Ϳ • ໘നͦ͏ͳٕज़ɾωλΛ௥ٻָͯ͠͠Έ͍ͨ •

    ʲISSUEʳεϙʔπ✕౷ܭͰ΋ͬͱ໘ന͍ࣄɹ˞ࢲͷ৔߹͸ • ৭ΜͳσʔλɾՁ஋؍͔Βੜ·ΕΔԾઆ • Ṷ͔Δ͔Ͳ͏͔͸ผͱͯ͠,ؾ͕͍ͭͨΒԿ͔΍Γ͍ͨཉ๬ উखʹςʔϚ࡞ͬͯݚڀָͯ͠͠Έ·͠ΐ͏Αʂ݁Ռ࿹΋্͕ΔͷͰ
  13. ʲAppendixʳ࢖ͬͨ΋ͷҰཡ • σʔλ෼ੳ • Jupyter notebook / Jupyter Lab https://jupyter.org/

    • Pandas https://pandas.pydata.org/ • Plotly https://plot.ly/python/ • GIS • GeoPyʢGeocodingʣ https://geopy.readthedocs.io/en/stable/ • Kepler.glʢVisualizationʣ https://kepler.gl/ • ⚾ ໺ٿɹ˞͢΂ͯMLBͰ͢ • Baseball Databank https://github.com/chadwickbureau/baseballdatabank • Retrosheet https://github.com/chadwickbureau/retrosheet • Analyzing Baseball Data with Rʢॻ੶,༸ॻʣ https://www.amazon.co.jp/dp/B07KRNP2BB