Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyConJP 2016: pandas による 時系列データ処理

Sinhrks
September 22, 2016

PyConJP 2016: pandas による 時系列データ処理

Sinhrks

September 22, 2016
Tweet

More Decks by Sinhrks

Other Decks in Science

Transcript

  1. ࣗݾ঺հ • @sinhrks • ۀ຿: σʔλ෼ੳ • OSS׆ಈ: PyData Development

    Team (pandas) Dask Development Team (Dask) • GitHub: https://github.com/sinhrks
  2. pandasͱ͸ • σʔλ෼ੳͷͨΊͷσʔλߏ଄ͱɺσʔλͷલॲཧ / ूܭʹ͓ ͍ͯศརͳؔ਺ / ϝιουΛఏڙ • Rͷ

    “data.frame” + α • ࡞ऀ: Wes McKinney • ϥΠηϯε: BSD • ҙຯ: PANel DAta System • GitHub: 7000↑⭐️
  3. import pandas as pd df = pd.read_csv(‘adult.csv’) df DataFrame "EVMU%BUBTFUUBLFOGSPN6$*.-3FQPTJUPSZ

    -JDINBO .  6$*.BDIJOF-FBSOJOH3FQPTJUPSZ<IUUQBSDIJWFJDTVDJFEVNM>*SWJOF $"6OJWFSTJUZPG$BMJGPSOJB 4DIPPMPG*OGPSNBUJPOBOE$PNQVUFS4DJFODF $47ϑΝΠϧͷಡΈࠐΈ
  4. pandasͷػೳ • ϕΫτϧԽ͞Εͨܭࢉ • άϧʔϓԽ ὎ ू໿ (split-apply-combine) • มܗ

    (merge, join, concat…) • ଟ༷ͳೖग़ྗ (SQL, CSV, Excel, …) • ॊೈͳ࣌ܥྻσʔλॲཧ • ՄࢹԽ
  5. ؀ڥ • όʔδϣϯ • Python 3.5.2 • pandas 0.19.0rc1 •

    statsmodels 0.8.0rc1 • ໊લۭؒ import numpy as np import pandas as pd import matplotlib.pyplot as plt import statsmodels.api as sm
  6. ݱ࣮ͷσʔλ͸ • ඞཁͳपظ͕ҟͳΔ • ೔࣍σʔλΛ݄࣍Ͱ෼ੳ͍ͨ͠ • पظతͰͳ͍ • Πϕϯτͷൃੜ͝ͱʹه࿥͞ΕͨϩάΛ෼ੳ͍ͨ͠ •

    ࣌ؒͰϥϕϧ෇͚͞Ε͍ͯͳ͍ • ೔࣌Λྻͱؚͯ͠ΉੜσʔλΛɺ͋Δपظ͝ͱʹूܭͯ͠෼ੳ͍ͨ͠ ԿΒ͔ͷલॲཧΛߦ͍ѻ͍΍͍͢ܗʹ͢Δ ੜσʔλ ࣌ܥྻσʔλ
  7. ࣌ܥྻσʔλͷ४උ values = [datetime.datetime(2001, 1, 1), datetime.datetime(2001, 2, 1), datetime.datetime(2001,

    3, 1)] s = pd.Series(np.arange(3), index=values) s 2001-01-01 0 2001-02-01 1 2001-03-01 2 dtype: int64 ೔࣌ͷϦετ ೔࣌Λϥϕϧͱ͢Δ Ұ࣍ݩσʔλ 4FSJFT df = pd.DataFrame({'঎඼A': [25, 27, 30], '঎඼B': [10, 15, 17]}, index=values) df ೔࣌Λϥϕϧͱ͢Δ ೋ࣍ݩσʔλ %BUB'SBNF
  8. ࣌ܥྻσʔλͷϥϕϧ df.index DatetimeIndex(['2001-01-01', '2001-02-01', '2001-03-01'], dtype='datetime64[ns]', freq=None) df df['঎඼A'] 2001-01-01

    25 2001-02-01 27 2001-03-01 30 Name: ঎඼A, dtype: int64 ϥϕϧ͸೔࣌ͷܕΛ࣋ͭ %BUFUJNF*OEFY ϥϕϧ JOEFY ྻͷબ୒
  9. γʔέϯεͷੜ੒ (pd.date_range) s = pd.Series(np.arange(10)) s 0 0 1 1

    2 2 dtype: int64 s.index = pd.date_range('2001-01-01', freq='M', periods=3) s 2001-01-31 0 2001-02-28 1 2001-03-31 2 Freq: M, dtype: int64 ϥϕϧ JOEFY Λ্ॻ͖ pd.date_range('2001-01-01', freq='M', periods=3) DatetimeIndex(['2001-01-31', '2001-02-28', '2001-03-31'], dtype='datetime64[ns]', freq='M') ͔Β݄࣍Ͱݸ ͷσʔλΛ࡞੒ ೔࣌ͷϥϕϧ͕ͳ͍σʔλ
  10. Frequency String pd.date_range('2016-01-01', freq='M', periods=3) DatetimeIndex(['2016-01-31', '2016-02-29', ‘2016-03-31’], dtype='datetime64[ns]', freq='M')

    pd.date_range('2016-01-01', freq='MS', periods=3) DatetimeIndex(['2016-01-01', '2016-02-01', '2016-03-01'], dtype='datetime64[ns]', freq='MS') pd.date_range('2016-01-01', freq='W', periods=3) DatetimeIndex(['2016-01-03', '2016-01-10', '2016-01-17'], dtype='datetime64[ns]', freq='W-SUN') pd.date_range('2016-01-01', freq='W-TUE', periods=3) DatetimeIndex(['2016-01-05', '2016-01-12', '2016-01-19'], dtype='datetime64[ns]', freq='W-TUE') 856& ि Ր༵࢝·Γ 8ि .4݄ॳ .݄຤
  11. ೔࣌ͷύʔε (pd.to_datetime) • ೔࣌จࣈྻΛߴ଎ʹύʔε • C Parser ὎ ਖ਼نදݱ ὎

    dateutil pd.to_datetime(['2016-09-22', '2016-09-23']) DatetimeIndex(['2016-09-22', ‘2016-09-23'], dtype='datetime64[ns]', freq=None) pd.to_datetime(['September 22nd, 2016', 'September 22nd, 2016']) DatetimeIndex(['2016-09-22', ‘2016-09-22’], dtype='datetime64[ns]', freq=None) pd.to_datetime(['22 Sep 2016', '23 Sep 2016']) DatetimeIndex(['2016-09-22', ‘2016-09-23'], dtype='datetime64[ns]', freq=None)
  12. ೔࣌ͷύʔε (pd.to_datetime) • ϑΥʔϚοτࢦఆʹΑΔॊೈͳύʔε΋Մೳ pd.to_datetime(['2016೥9݄22೔', '2016೥9݄23೔']) ValueError: Unknown string format

    pd.to_datetime(['2016೥9݄22೔', '2016೥9݄23೔'], format='%Y೥%m݄%d೔') DatetimeIndex(['2016-09-22', ‘2016-09-23'], dtype='datetime64[ns]', freq=None)
  13. σʔλબ୒ idx = pd.date_range('2016-01-01', freq='D', periods=366) df = pd.DataFrame({'঎඼A': np.random.randint(100,

    size=366), '঎඼B': np.random.randint(100, size=366)}, index=idx) df.loc[datetime.datetime(2016, 1, 2)] ঎඼A 12 ঎඼B 64 Name: 2016-01-02 00:00:00, dtype: int64 ͷߦΛબ୒ ݁Ռ͸4FSJFT df.loc['2016-01-02'] ঎඼A 12 ঎඼B 64 Name: 2016-01-02 00:00:00, dtype: int64 จࣈྻ͸೔࣌ͱͯ͠ѻΘΕΔ df
  14. ৚݅ʹΑΔબ୒ df.index.month df.loc[(df.index.month == 1) | (df.index.month == 3)] array([

    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 12, 12, 12, 12, 12, 12, 12, 12, 12], dtype=int32) ϕΫτϧԽ͞Εͨ  ϓϩύςΟΞΫηε (df.index.month == 1) | (df.index.month == 3) array([ True, True, True, True, True, True, True, ... False, False, False, False, False, False], dtype=bool)
  15. ϓϩύςΟΞΫηε • ೔࣌ͷଐੑʹԠͨ͡ॲཧ͕؆୯ʹॻ͚Δ ϓϩύςΟ ϓϩύςΟ ZFBS EBUF NPOUI UJNF EBZ

    EBZPGZFBS IPVS XFFLPGZFBS NJOVUF XFFL TFDPOE EBZPGXFFL NJDSPTFDPOE XFFLEBZ OBOPTFDPOE XFFLEBZ@OBNF RVBSUFS
  16. ϦαϯϓϦϯά (.resample) • αϯϓϧσʔλ • ਖ਼نཚ਺ͷྦྷੵ࿨ (ϥϯμϜ΢ΥʔΫ) idx = pd.date_range('2016-09-22',

    freq='H', periods=50) df = pd.DataFrame({'val': np.random.randn(50)}, index=idx) df = df.cumsum() df
  17. ϦαϯϓϦϯά (.resample) • ༷ʑͳू໿͕Մೳ ू໿ϝιου ू໿ϝιου ⒏MM NFEJBO CBDLpMM NJO

    QBE PIMD pMMOB QSPE JOUFSQPMBUF TJ[F DPVOU TFN OVOJRVF TUE pSTU MB TVN MBTU WBS NBY
  18. ิ׬ (.interpolate) • αϯϓϧσʔλ • ஋ʹܽଛ (NaN) ΛؚΉ indexer =

    np.random.randint(4, size=50) == 1 df.loc[indexer] = np.nan df ܽଛ ܽଛ
  19. γϑτͷ࢖͍ํ (ྫ) idx = pd.date_range('2016-09-22 10:00', freq='T', periods=50) df =

    pd.DataFrame({'val': np.repeat([0, 1, 0, 1], [10, 20, 10, 10])}, index=idx) df df.index[df['val'] != df[‘val'].shift()] DatetimeIndex(['2016-09-22 10:00:00', '2016-09-22 10:10:00', '2016-09-22 10:30:00', '2016-09-22 10:40:00'], dtype='datetime64[ns]', freq=None)
  20. ೔࣌σʔλͷूܭ • αϯϓϧσʔλ • ঎඼ͷൃ஫σʔλ df = pd.DataFrame({'਺ྔ': np.random.randint(100, size=1000),

    '঎඼໊': np.random.choice(list('ABC'), 1000), 'ൃ஫೔': np.random.choice(idx, 1000)}) df
  21. ϓϩύςΟΞΫηε (.dt) df['ൃ஫೔'].dt.weekday df.groupby(df['ൃ஫೔'].dt.weekday).sum() 0 6 1 4 2 6

    .. 997 6 998 5 999 2 Name: ൃ஫೔, dtype: int64 EUϓϩύςΟΛ௨ͯ͡ɺ ೔࣌ϓϩύςΟ΁ͷΞΫηε͕Մೳ
  22. Ϋϩεूܭ (pd.pivot_table) • pd.pivot_table + • pd.Grouper • ϓϩύςΟΞΫηε pd.pivot_table(df,

    index=pd.Grouper(key='ൃ஫೔', freq='M'), columns='঎඼໊', values='਺ྔ', aggfunc='sum')
  23. ΧϨϯμʔ • pandas.tseries.offsets.CustomBusinessDay • ॕ೔Λߟྀͨ͠ॲཧ • japandas (https://github.com/sinhrks/japandas) • JapaneseHolidayCalendar

    from pandas.tseries.offsets import CustomBusinessDay import japandas cal = japandas.JapaneseHolidayCalendar() cbd = CustomBusinessDay(calendar=cal) idx = pd.DatetimeIndex(['2016-09-20', '2016-09-21', ‘2016-09-22']) idx + cbd DatetimeIndex(['2016-09-21', '2016-09-23', '2016-09-23'], dtype='datetime64[ns]', freq=None) ͸ॕ೔
  24. ࣌ܥྻσʔλͷ౷ܭϞσϧ • ໨త • ࣌ܥྻͷؔ܎Λௐ΂͍ͨ • কདྷͷ༧ଌΛ͍ͨ͠ • มԽ఺ /

    ҟৗ஋Λݕ஌͍ͨ͠ • … • ࣌ܥྻσʔλͷཹҙ఺ • ͋Δ࣌఺Ҏલͷσʔλ͔ΒͷӨڹ͕͋Δ͔ʁ • τϨϯυ΍قઅੑ͕͋Δ͔ʁ
  25. ࣌ܥྻϞσϧΛؚΈPythonύοέʔδ • ར༻͍ͨ͠Ϟσϧʹ߹ΘͤͯύοέʔδΛબͿ • ඞཁʹԠ͡ R Λར༻ (rpy2, pypeR) 4UBUT.PEFMT

    1Z'MVY ౷ܭྔݕఆ ✅ "3*." ✅ ✅ 7"3 ✅ ✅ ("3$) ✅ TBOECPY ✅ ("4 ✅ 4UBUF4QBDF ✅ SD ✅
  26. ౷ܭྔ • ඪຊࣗݾ૬ؔ (ACF) • ҟ࣌఺ؒͷڞ෼ࢄΛඪ४Խͨ͠΋ͷ • ඪຊภࣗݾ૬ؔ (PACF) fig,

    axes = plt.subplots(1, 2) sm.tsa.graphics.plot_acf(df, ax=axes[0]); sm.tsa.graphics.plot_pacf(df, ax=axes[1]);
  27. ౷ܭྔ • ਖ਼نཚ਺ (ϗϫΠτϊΠζ) ͷ৔߹ • աڈͷ஋ͱ૬͕ؔͳ͍ wn = pd.Series(np.random.randn(100))

    fig, axes = plt.subplots(1, 2) sm.tsa.graphics.plot_acf(wn, ax=axes[0]); sm.tsa.graphics.plot_pacf(wn, ax=axes[1]);
  28. SARIMAϞσϧ • قઅతࣗݾճؼ࿨෼ҠಈฏۉϞσϧ • ࣗݾճؼ࿨෼ҠಈฏۉϞσϧ (ARIMA) • + قઅมಈ (ARIMA)

    • ARIMA (p, d, q) • d֊ࠩ෼Λͱͬͨ࣌ܥྻ yt ͕ • (ऑ)ఆৗ (ฏۉɺࣗݾڞ෼ࢄ͕࣌ؒʹΑΒͣҰఆ) • ҎԼͷաఔʹै͏ • yt = c + φ1yt-1 + … + φpyp + εt + θ1εt-1 + … + θqεt-q
  29. ୯Ґࠜݕఆ • Augmented Dickey-Fullerݕఆ sm.tsa.adfuller(df['Air passengers'])[1] 0.99188024343764114 sm.tsa.adfuller(ldf['Air passengers'])[1] 0.42236677477038415

    sm.tsa.adfuller(ldf['Air passengers'].diff().dropna())[1] 0.071120548150854057 ݩσʔλ ର਺Խ ର਺Խ ֊ࠩ sm.tsa.adfuller(seasonal_adjust['Air passengers'].diff().dropna())[1] 8.0990048658604878e-09 ର਺Խ قઅੑআڈ ֊ࠩ
  30. SARIMAϞσϧͷਪఆ mod_seasonal = sm.tsa.SARIMAX(ldf, trend='c', order=(1, 1, 1), seasonal_order=(0, 1,

    2, 12)) res_seasonal = mod_seasonal.fit() res_seasonal.summary() ʜ ʜ "3*."قઅ੒෼ͷύϥϝʔλ SD͕ඞཁ
  31. Ϟσϧ͔Βͷ༧ଌ pred = res_seasonal.forecast(36) pred 1961-01-01 6.110548 1961-02-01 6.052912 1961-03-01

    6.174690 ... 1963-10-01 6.388955 1963-11-01 6.242262 1963-12-01 6.345214 Freq: MS, dtype: float64 ax = ldf.plot() pred.plot(ax=ax) ظઌΛ༧ଌ ݩσʔλ ༧ଌ஋Λϓϩοτ
  32. ։ൃϩʔυϚοϓ • ܭը • 0.19 (ݱࡏrc) ὎ 0.20 ὎ 1.0

    ΛϦϦʔε • pandas 1.0 • API ౚ݁ • Long Time Support • pandas 2.0 (under discussion) • Python 3.xͷΈΛαϙʔτ • 2࣍ݩҎԼͷσʔλʹಛԽ • όοΫΤϯυΛ C++ ʹҠߦ (Apache Arrow)