Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyConJP 2016: pandas による 時系列データ処理

22f56e55955b9aa693081ed5dc6400ae?s=47 Sinhrks
September 22, 2016

PyConJP 2016: pandas による 時系列データ処理

22f56e55955b9aa693081ed5dc6400ae?s=128

Sinhrks

September 22, 2016
Tweet

Transcript

  1. pandas ʹΑΔ ࣌ܥྻσʔλॲཧ @PyConJP 2016 sinhrks

  2. ࣗݾ঺հ • @sinhrks • ۀ຿: σʔλ෼ੳ • OSS׆ಈ: PyData Development

    Team (pandas) Dask Development Team (Dask) • GitHub: https://github.com/sinhrks
  3. ໨త • ໨త: • ࣌ܥྻσʔλ෼ੳͷͨΊͷޮ཰తͳॲཧΛ஌Δ • ࣌ܥྻϞσϧͷΠϯτϩμΫγϣϯ

  4. ໨࣍ • pandasͱ͸ • ࣌ܥྻσʔλͷॲཧ • ࣌ܥྻσʔλͷ౷ܭϞσϧ • ͓·͚: •

    ։ൃϩʔυϚοϓ
  5. pandasͱ͸?

  6. pandasͱ͸ • σʔλ෼ੳͷͨΊͷσʔλߏ଄ͱɺσʔλͷલॲཧ / ूܭʹ͓ ͍ͯศརͳؔ਺ / ϝιουΛఏڙ • Rͷ

    “data.frame” + α • ࡞ऀ: Wes McKinney • ϥΠηϯε: BSD • ҙຯ: PANel DAta System • GitHub: 7000↑⭐️
  7. pandasΛ࢖͏ϝϦοτ • ݱ࣮ͷ(Ԛ͍)σʔλʹରԠ • ௚ײతͳૢ࡞ • ߴ଎ • ࢀߟ: pandas

    internals @PyConJP 2015
  8. pandasͷσʔλߏ଄ • σʔλͷ࣍ݩ͝ͱʹఆٛ 4FSJFT ࣍ݩ %BUB'SBNF ࣍ݩ 1BOFM ࣍ݩ ৭෇͖ͷηϧ͸ϥϕϧ

    ࣍ݩҎ্ͷσʔλߏ଄͸WͰඇਪ঑
  9. DataFrame • 2࣍ݩͷσʔλߏ଄: • ߦ (index) ͱ ྻ(columns) ʹϥϕϧΛ࣋ͭ •

    ྻ͝ͱʹܕΛ࣋ͭ $PMVNOT *OEFY JOUܕ PCKFDUܕ
  10. import pandas as pd df = pd.read_csv(‘adult.csv’) df DataFrame "EVMU%BUBTFUUBLFOGSPN6$*.-3FQPTJUPSZ

    -JDINBO .  6$*.BDIJOF-FBSOJOH3FQPTJUPSZ<IUUQBSDIJWFJDTVDJFEVNM>*SWJOF $"6OJWFSTJUZPG$BMJGPSOJB 4DIPPMPG*OGPSNBUJPOBOE$PNQVUFS4DJFODF $47ϑΝΠϧͷಡΈࠐΈ
  11. DataFrame df[['age', 'marital-status']] df.groupby('income')['hours-per-week'].mean() άϧʔϓԽ ྻબ୒ ू໿ ฏۉ ྻͷબ୒

  12. pandasͷػೳ • ϕΫτϧԽ͞Εͨܭࢉ • άϧʔϓԽ ὎ ू໿ (split-apply-combine) • มܗ

    (merge, join, concat…) • ଟ༷ͳೖग़ྗ (SQL, CSV, Excel, …) • ॊೈͳ࣌ܥྻσʔλॲཧ • ՄࢹԽ
  13. ؀ڥ • όʔδϣϯ • Python 3.5.2 • pandas 0.19.0rc1 •

    statsmodels 0.8.0rc1 • ໊લۭؒ import numpy as np import pandas as pd import matplotlib.pyplot as plt import statsmodels.api as sm
  14. pandasʹΑΔ ࣌ܥྻσʔλॲཧ

  15. ࣌ܥྻσʔλͱ͸ • ݱ࣮ͷσʔλ͸ʁ ͋Δݱ৅ͷ࣌ؒతͳมԽΛɺ࿈ଓతʹʢ·ͨ͸ҰఆִؒΛ͓͍ͯ ෆ࿈ଓʹʣ؍ଌͯ͠ಘΒΕͨ஋ͷܥྻʢҰ࿈ͷ஋ʣ XJLJQFEJBΑΓ

  16. ݱ࣮ͷσʔλ͸ • ඞཁͳपظ͕ҟͳΔ • ೔࣍σʔλΛ݄࣍Ͱ෼ੳ͍ͨ͠ • पظతͰͳ͍ • Πϕϯτͷൃੜ͝ͱʹه࿥͞ΕͨϩάΛ෼ੳ͍ͨ͠ •

    ࣌ؒͰϥϕϧ෇͚͞Ε͍ͯͳ͍ • ೔࣌Λྻͱؚͯ͠ΉੜσʔλΛɺ͋Δपظ͝ͱʹूܭͯ͠෼ੳ͍ͨ͠ ԿΒ͔ͷલॲཧΛߦ͍ѻ͍΍͍͢ܗʹ͢Δ ੜσʔλ ࣌ܥྻσʔλ
  17. ࣌ܥྻσʔλͷ४උ values = [datetime.datetime(2001, 1, 1), datetime.datetime(2001, 2, 1), datetime.datetime(2001,

    3, 1)] s = pd.Series(np.arange(3), index=values) s 2001-01-01 0 2001-02-01 1 2001-03-01 2 dtype: int64 ೔࣌ͷϦετ ೔࣌Λϥϕϧͱ͢Δ Ұ࣍ݩσʔλ 4FSJFT df = pd.DataFrame({'঎඼A': [25, 27, 30], '঎඼B': [10, 15, 17]}, index=values) df ೔࣌Λϥϕϧͱ͢Δ ೋ࣍ݩσʔλ %BUB'SBNF
  18. ࣌ܥྻσʔλͷϥϕϧ df.index DatetimeIndex(['2001-01-01', '2001-02-01', '2001-03-01'], dtype='datetime64[ns]', freq=None) df df['঎඼A'] 2001-01-01

    25 2001-02-01 27 2001-03-01 30 Name: ঎඼A, dtype: int64 ϥϕϧ͸೔࣌ͷܕΛ࣋ͭ %BUFUJNF*OEFY ϥϕϧ JOEFY ྻͷબ୒
  19. σʔλͷ४උ • ΍Γ͍ͨ͜ͱ • 1. σʔλʹ೔࣌ͷϥϕϧΛ͚͍ͭͨ • 2. ೚ҙͷ೔࣌ϑΥʔϚοτΛύʔε͍ͨ͠

  20. γʔέϯεͷੜ੒ (pd.date_range) s = pd.Series(np.arange(10)) s 0 0 1 1

    2 2 dtype: int64 s.index = pd.date_range('2001-01-01', freq='M', periods=3) s 2001-01-31 0 2001-02-28 1 2001-03-31 2 Freq: M, dtype: int64 ϥϕϧ JOEFY Λ্ॻ͖ pd.date_range('2001-01-01', freq='M', periods=3) DatetimeIndex(['2001-01-31', '2001-02-28', '2001-03-31'], dtype='datetime64[ns]', freq='M') ͔Β݄࣍Ͱݸ ͷσʔλΛ࡞੒ ೔࣌ͷϥϕϧ͕ͳ͍σʔλ
  21. Frequency String • ੜ੒͢Δ࣌ܥྻͷपظΛࢦఆ͢Δ • ଞɺશ25छྨ 'SFRVFODZ4USJOH ҙຯ " ೥຤

    . ݄຤ 8 ि % ೔ ) ࣌ 5 ෼ 4 ඵ
  22. Frequency String pd.date_range('2016-01-01', freq='M', periods=3) DatetimeIndex(['2016-01-31', '2016-02-29', ‘2016-03-31’], dtype='datetime64[ns]', freq='M')

    pd.date_range('2016-01-01', freq='MS', periods=3) DatetimeIndex(['2016-01-01', '2016-02-01', '2016-03-01'], dtype='datetime64[ns]', freq='MS') pd.date_range('2016-01-01', freq='W', periods=3) DatetimeIndex(['2016-01-03', '2016-01-10', '2016-01-17'], dtype='datetime64[ns]', freq='W-SUN') pd.date_range('2016-01-01', freq='W-TUE', periods=3) DatetimeIndex(['2016-01-05', '2016-01-12', '2016-01-19'], dtype='datetime64[ns]', freq='W-TUE') 856& ि Ր༵࢝·Γ 8ि .4݄ॳ .݄຤
  23. ೔࣌ͷύʔε (pd.to_datetime) • ೔࣌จࣈྻΛߴ଎ʹύʔε • C Parser ὎ ਖ਼نදݱ ὎

    dateutil pd.to_datetime(['2016-09-22', '2016-09-23']) DatetimeIndex(['2016-09-22', ‘2016-09-23'], dtype='datetime64[ns]', freq=None) pd.to_datetime(['September 22nd, 2016', 'September 22nd, 2016']) DatetimeIndex(['2016-09-22', ‘2016-09-22’], dtype='datetime64[ns]', freq=None) pd.to_datetime(['22 Sep 2016', '23 Sep 2016']) DatetimeIndex(['2016-09-22', ‘2016-09-23'], dtype='datetime64[ns]', freq=None)
  24. ೔࣌ͷύʔε (pd.to_datetime) • ϑΥʔϚοτࢦఆʹΑΔॊೈͳύʔε΋Մೳ pd.to_datetime(['2016೥9݄22೔', '2016೥9݄23೔']) ValueError: Unknown string format

    pd.to_datetime(['2016೥9݄22೔', '2016೥9݄23೔'], format='%Y೥%m݄%d೔') DatetimeIndex(['2016-09-22', ‘2016-09-23'], dtype='datetime64[ns]', freq=None)
  25. σʔλબ୒ • ΍Γ͍ͨ͜ͱ • 1. ͋Δ೔࣌Λબ୒͍ͨ͠ • 2. ͋ΔظؒΛબ୒͍ͨ͠ •

    3. ͋Δ৚݅Λຬͨ͢೔࣌Λબ୒͍ͨ͠
  26. σʔλબ୒ idx = pd.date_range('2016-01-01', freq='D', periods=366) df = pd.DataFrame({'঎඼A': np.random.randint(100,

    size=366), '঎඼B': np.random.randint(100, size=366)}, index=idx) df.loc[datetime.datetime(2016, 1, 2)] ঎඼A 12 ঎඼B 64 Name: 2016-01-02 00:00:00, dtype: int64 ͷߦΛબ୒ ݁Ռ͸4FSJFT df.loc['2016-01-02'] ঎඼A 12 ঎඼B 64 Name: 2016-01-02 00:00:00, dtype: int64 จࣈྻ͸೔࣌ͱͯ͠ѻΘΕΔ df
  27. εϥΠεʹΑΔબ୒ Ҏ߱Λબ୒ df.loc['2016-09-22':] df df.loc['2016-09-01':'2016-09-30':2] ʙ·Ͱ ೔͓͖ʹબ୒

  28. ෦෼จࣈྻʹΑΔબ୒ df['2016-03'] df df['2016-03':'2016-05'] ݄ʙ݄ͷσʔλΛ બ୒ จࣈྻ͕೔෇Λؚ·ͳ͍ ݄ͷσʔλΛબ୒

  29. ৚݅ʹΑΔબ୒ df.index.month df.loc[(df.index.month == 1) | (df.index.month == 3)] array([

    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 12, 12, 12, 12, 12, 12, 12, 12, 12], dtype=int32) ϕΫτϧԽ͞Εͨ  ϓϩύςΟΞΫηε (df.index.month == 1) | (df.index.month == 3) array([ True, True, True, True, True, True, True, ... False, False, False, False, False, False], dtype=bool)
  30. ϓϩύςΟΞΫηε • ೔࣌ͷଐੑʹԠͨ͡ॲཧ͕؆୯ʹॻ͚Δ ϓϩύςΟ ϓϩύςΟ ZFBS EBUF NPOUI UJNF EBZ

    EBZPGZFBS IPVS XFFLPGZFBS NJOVUF XFFL TFDPOE EBZPGXFFL NJDSPTFDPOE XFFLEBZ OBOPTFDPOE XFFLEBZ@OBNF RVBSUFS
  31. ೔࣌σʔλͷલॲཧ • ΍Γ͍ͨ͜ͱ • 1. ೔࣌ͷपظΛม͍͑ͨ • 2. ܽଛ஋Λิ׬͍ͨ͠ •

    3. લޙͷ஋ͱൺֱ / ܭࢉ͍ͨ͠
  32. ϦαϯϓϦϯά (.resample) • αϯϓϧσʔλ • ਖ਼نཚ਺ͷྦྷੵ࿨ (ϥϯμϜ΢ΥʔΫ) idx = pd.date_range('2016-09-22',

    freq='H', periods=50) df = pd.DataFrame({'val': np.random.randn(50)}, index=idx) df = df.cumsum() df
  33. ϦαϯϓϦϯά (.resample) df.resample('6H').mean() df.resample('30T').interpolate() μ΢ϯαϯϓϦϯά ΞοϓαϯϓϦϯά

  34. ϦαϯϓϦϯά (.resample) • ༷ʑͳू໿͕Մೳ ू໿ϝιου ू໿ϝιου ⒏MM NFEJBO CBDLpMM NJO

    QBE PIMD pMMOB QSPE JOUFSQPMBUF TJ[F DPVOU TFN OVOJRVF TUE pSTU MB TVN MBTU WBS NBY
  35. ิ׬ (.interpolate) • αϯϓϧσʔλ • ஋ʹܽଛ (NaN) ΛؚΉ indexer =

    np.random.randint(4, size=50) == 1 df.loc[indexer] = np.nan df ܽଛ ܽଛ
  36. ิ׬ (.interpolate) • ܽଛ஋ͷิ׬ • ಺෦Ͱ scipy.interpolate Λར༻ df.interpolate()

  37. ΢Οϯυ΢ؔ਺ (.rolling) • .resample ͱಉ͘͡ɺू໿ϝιουΛνΣΠϯͰ ͖Δ df.rolling(3).mean()

  38. γϑτ (.shift) • ஋Λࢦఆ͞Εͨ periods ͚ͩͣΒ͢ • લޙͷ஋ͱͷൺֱ / ܭࢉΛ͢Δࡍʹศར

    df.shift(periods=1)
  39. ࠩ෼ (.diff) • ࢦఆ͞Εͨ periods ͱͷࠩΛͱΔ • df - df.shift()

    ͱಉ͡ df.diff(periods=1)
  40. γϑτͷ࢖͍ํ (ྫ) idx = pd.date_range('2016-09-22 10:00', freq='T', periods=50) df =

    pd.DataFrame({'val': np.repeat([0, 1, 0, 1], [10, 20, 10, 10])}, index=idx) df df.index[df['val'] != df[‘val'].shift()] DatetimeIndex(['2016-09-22 10:00:00', '2016-09-22 10:10:00', '2016-09-22 10:30:00', '2016-09-22 10:40:00'], dtype='datetime64[ns]', freq=None)
  41. ूܭ • ΍Γ͍ͨ͜ͱ • ೔࣌ΛؚΉੜσʔλΛूܭ͍ͨ͠

  42. ೔࣌σʔλͷूܭ • αϯϓϧσʔλ • ঎඼ͷൃ஫σʔλ df = pd.DataFrame({'਺ྔ': np.random.randint(100, size=1000),

    '঎඼໊': np.random.choice(list('ABC'), 1000), 'ൃ஫೔': np.random.choice(idx, 1000)}) df
  43. ೔࣌σʔλͷूܭ • pd.Grouper • ྻ໊ͱपظΛࢦఆͨ͠άϧʔϓԽ df.groupby([pd.Grouper(key='ൃ஫೔', freq='M'), '঎඼໊']).sum()

  44. ϓϩύςΟΞΫηε (.dt) df['ൃ஫೔'].dt.weekday df.groupby(df['ൃ஫೔'].dt.weekday).sum() 0 6 1 4 2 6

    .. 997 6 998 5 999 2 Name: ൃ஫೔, dtype: int64 EUϓϩύςΟΛ௨ͯ͡ɺ ೔࣌ϓϩύςΟ΁ͷΞΫηε͕Մೳ
  45. Ϋϩεूܭ (pd.pivot_table) • pd.pivot_table + • pd.Grouper • ϓϩύςΟΞΫηε pd.pivot_table(df,

    index=pd.Grouper(key='ൃ஫೔', freq='M'), columns='঎඼໊', values='਺ྔ', aggfunc='sum')
  46. ΧϨϯμʔ • pandas.tseries.offsets.CustomBusinessDay • ॕ೔Λߟྀͨ͠ॲཧ • japandas (https://github.com/sinhrks/japandas) • JapaneseHolidayCalendar

    from pandas.tseries.offsets import CustomBusinessDay import japandas cal = japandas.JapaneseHolidayCalendar() cbd = CustomBusinessDay(calendar=cal) idx = pd.DatetimeIndex(['2016-09-20', '2016-09-21', ‘2016-09-22']) idx + cbd DatetimeIndex(['2016-09-21', '2016-09-23', '2016-09-23'], dtype='datetime64[ns]', freq=None) ͸ॕ೔
  47. ՄࢹԽ • ࣌ܥྻσʔλͷपظΛࣗಈͰௐ੔ͯ͠ϓϩοτ idx1 = pd.date_range('2016-09-01', freq='D', periods=50) df1 =

    pd.DataFrame({'val1': np.random.randn(50)}, index=idx1) df1.plot()
  48. ՄࢹԽ • ࣌ܥྻσʔλͷपظΛࣗಈͰௐ੔ͯ͠ϓϩοτ • पظ͕ҟͳΔ৔߹΋ࣗಈௐ੔ idx2 = pd.date_range('2016-09-01', freq='M', periods=3)

    df2 = pd.DataFrame({'val1': np.random.randn(3)}, index=idx2) ax = df1.plot() df2.plot(ax=ax)
  49. ࣌ܥྻσʔλͷ ౷ܭϞσϧ

  50. ࣌ܥྻσʔλͷ౷ܭϞσϧ • ໨త • ࣌ܥྻͷؔ܎Λௐ΂͍ͨ • কདྷͷ༧ଌΛ͍ͨ͠ • มԽ఺ /

    ҟৗ஋Λݕ஌͍ͨ͠ • … • ࣌ܥྻσʔλͷཹҙ఺ • ͋Δ࣌఺Ҏલͷσʔλ͔ΒͷӨڹ͕͋Δ͔ʁ • τϨϯυ΍قઅੑ͕͋Δ͔ʁ
  51. ࣌ܥྻϞσϧΛؚΈPythonύοέʔδ • ར༻͍ͨ͠Ϟσϧʹ߹ΘͤͯύοέʔδΛબͿ • ඞཁʹԠ͡ R Λར༻ (rpy2, pypeR) 4UBUT.PEFMT

    1Z'MVY ౷ܭྔݕఆ ✅ "3*." ✅ ✅ 7"3 ✅ ✅ ("3$) ✅ TBOECPY ✅ ("4 ✅ 4UBUF4QBDF ✅ SD ✅
  52. αϯϓϧσʔλ • AirPassengers • ݄࣍ͷࠃࡍઢ౥৐ਓ਺ (ઍਓ) • ୯มྔɺτϨϯυͱقઅੑΛ࣋ͭ df =

    pd.read_csv('airpassengers.csv', index_col=0, parse_dates=[0]) df
  53. • ࣌ܥྻΛτϨϯυɺقઅੑɺ࢒ࠩʹ෼ղ ࣌ܥྻͷ੒෼෼ղ res = sm.tsa.seasonal_decompose(df) fig = res.plot(); ݩσʔλ

    τϨϯυ قઅੑ ࢒ࠩ
  54. ౷ܭྔ • ඪຊࣗݾ૬ؔ (ACF) • ҟ࣌఺ؒͷڞ෼ࢄΛඪ४Խͨ͠΋ͷ • ඪຊภࣗݾ૬ؔ (PACF) fig,

    axes = plt.subplots(1, 2) sm.tsa.graphics.plot_acf(df, ax=axes[0]); sm.tsa.graphics.plot_pacf(df, ax=axes[1]);
  55. ౷ܭྔ • ਖ਼نཚ਺ (ϗϫΠτϊΠζ) ͷ৔߹ • աڈͷ஋ͱ૬͕ؔͳ͍ wn = pd.Series(np.random.randn(100))

    fig, axes = plt.subplots(1, 2) sm.tsa.graphics.plot_acf(wn, ax=axes[0]); sm.tsa.graphics.plot_pacf(wn, ax=axes[1]);
  56. SARIMAϞσϧ • قઅతࣗݾճؼ࿨෼ҠಈฏۉϞσϧ • ࣗݾճؼ࿨෼ҠಈฏۉϞσϧ (ARIMA) • + قઅมಈ (ARIMA)

    • ARIMA (p, d, q) • d֊ࠩ෼Λͱͬͨ࣌ܥྻ yt ͕ • (ऑ)ఆৗ (ฏۉɺࣗݾڞ෼ࢄ͕࣌ؒʹΑΒͣҰఆ) • ҎԼͷաఔʹै͏ • yt = c + φ1yt-1 + … + φpyp + εt + θ1εt-1 + … + θqεt-q
  57. τϨϯυͷআڈ df.plot() df.diff().plot() ෼ࢄ͕େ͖͘ͳ͍ͬͯΔ ֊ࠩ

  58. ର਺ม׵ ldf = np.log(df) ldf.plot() ldf.diff().plot() ର਺ม׵

  59. قઅੑͷআڈ res = sm.tsa.seasonal_decompose(ldf) seasonal_adjust = (ldf - res.seasonal) seasonal_adjust.plot()

    قઅ੒෼ΛҾ͘
  60. ୯Ґࠜݕఆ • Augmented Dickey-Fullerݕఆ sm.tsa.adfuller(df['Air passengers'])[1] 0.99188024343764114 sm.tsa.adfuller(ldf['Air passengers'])[1] 0.42236677477038415

    sm.tsa.adfuller(ldf['Air passengers'].diff().dropna())[1] 0.071120548150854057 ݩσʔλ ର਺Խ ର਺Խ ֊ࠩ sm.tsa.adfuller(seasonal_adjust['Air passengers'].diff().dropna())[1] 8.0990048658604878e-09 ର਺Խ قઅੑআڈ ֊ࠩ
  61. SARIMAϞσϧͷਪఆ mod_seasonal = sm.tsa.SARIMAX(ldf, trend='c', order=(1, 1, 1), seasonal_order=(0, 1,

    2, 12)) res_seasonal = mod_seasonal.fit() res_seasonal.summary() ʜ ʜ "3*."قઅ੒෼ͷύϥϝʔλ SD͕ඞཁ
  62. Ϟσϧ͔Βͷ༧ଌ pred = res_seasonal.forecast(36) pred 1961-01-01 6.110548 1961-02-01 6.052912 1961-03-01

    6.174690 ... 1963-10-01 6.388955 1963-11-01 6.242262 1963-12-01 6.345214 Freq: MS, dtype: float64 ax = ldf.plot() pred.plot(ax=ax) ظઌΛ༧ଌ ݩσʔλ ༧ଌ஋Λϓϩοτ
  63. ·ͱΊ • pandas Λ࢖ͬͯ࣌ܥྻ·ΘΓͷॲཧΛ͢Δํ๏ • PythonͰ࣌ܥྻϞσϧΛѻ͏ํ๏ (ͷ৮Γ)

  64. ։ൃϩʔυϚοϓ • ܭը • 0.19 (ݱࡏrc) ὎ 0.20 ὎ 1.0

    ΛϦϦʔε • pandas 1.0 • API ౚ݁ • Long Time Support • pandas 2.0 (under discussion) • Python 3.xͷΈΛαϙʔτ • 2࣍ݩҎԼͷσʔλʹಛԽ • όοΫΤϯυΛ C++ ʹҠߦ (Apache Arrow)