Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Explorando dados com Pandas

Explorando dados com Pandas

Python Day Natal 2017

Felipe Pontes

December 02, 2017
Tweet

More Decks by Felipe Pontes

Other Decks in Programming

Transcript

  1. pandas is an open source, BSD-licensed library providing high- performance,

    easy-to-use data structures and data analysis tools for the Python programming language
  2. numpy ndarray e matriz Acesso por índice Tipo único de

    dados In [117]: índice 0 1 2 3 elemento 1 3 5 7 import numpy as np arr = np.array([1, 3, 5, 7], dtype=np.int64) arr Out[117]: array([1, 3, 5, 7])
  3. Por quê? Aplicável para o mundo real Estruturas de dados

    intuitivas Baterias inclusas para preparação, análise e exploração de dados
  4. Series Lista rotulada de 1 dimensão Qualquer tipo de dado

    inteiros strings decimais objetos s = pd.Series(data, index=index)
  5. Criando Series A partir de um np.ndarray In [119]: In

    [120]: In [121]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e']) s s.index pd.Series(np.random.randn(5)) Out[119]: a -0.363051 b 0.657138 c -1.482879 d 0.019643 e -0.009886 dtype: float64 Out[120]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object') Out[121]: 0 0.158455 1 0.758866 2 -0.933195 3 0.785129 4 0.420429 dtype: float64
  6. Criando Series A partir de um dict In [122]: In

    [123]: d = {'a' : 0., 'b' : 1., 'c' : 2.} pd.Series(d) pd.Series(d, index=['b', 'c', 'd', 'a']) NaN signi ca "Not a Number" e é um marcador padrão para valores que estão faltando. Out[122]: a 0.0 b 1.0 c 2.0 dtype: float64 Out[123]: b 1.0 c 2.0 d NaN a 0.0 dtype: float64
  7. Criando Series A partir de um valor In [124]: pd.Series(5.,

    index=['a', 'b', 'c', 'd', 'e']) Out[124]: a 5.0 b 5.0 c 5.0 d 5.0 e 5.0 dtype: float64
  8. Series são como np.ndarray In [125]: In [126]: In [127]:

    s[0] s[:3] s[[4, 3, 1]] Out[125]: -0.3630508545612619 Out[126]: a -0.363051 b 0.657138 c -1.482879 dtype: float64 Out[127]: e -0.009886 d 0.019643 b 0.657138 dtype: float64
  9. Series são como dict In [128]: In [129]: In [130]:

    In [131]: In [132]: s['a'] s['e'] = 12. s 'e' in s 'f' in s Out[128]: -0.3630508545612619 Out[130]: a -0.363051 b 0.657138 c -1.482879 d 0.019643 e 12.000000 dtype: float64 Out[131]: True Out[132]: False
  10. Operações com Series In [133]: In [134]: In [135]: s

    + s s * 3 s[1:] + s[:-1] Out[133]: a -0.726102 b 1.314275 c -2.965758 d 0.039287 e 24.000000 dtype: float64 Out[134]: a -1.089153 b 1.971413 c -4.448637 d 0.058930 e 36.000000 dtype: float64 Out[135]: a NaN b 1.314275 c -2.965758 d 0.039287 e NaN dtype: float64
  11. Criando DataFrames A partir de um dict In [136]: d

    = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) df Out[136]: one two a 1.0 1.0 b 2.0 2.0 c 3.0 3.0 d NaN 4.0
  12. Criando DataFrames A partir de um dict In [137]: In

    [138]: pd.DataFrame(d, index=['d', 'b', 'a']) pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three']) Out[137]: one two d NaN 4.0 b 2.0 2.0 a 1.0 1.0 Out[138]: two three d 4.0 NaN b 2.0 NaN a 1.0 NaN
  13. Criando DataFrames A partir de uma lista de dict In

    [139]: In [140]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}] pd.DataFrame(data2) pd.DataFrame(data2, index=['first', 'second']) Out[139]: a b c 0 1 2 NaN 1 5 10 20.0 Out[140]: a b c rst 1 2 NaN second 5 10 20.0
  14. Criando DataFrames A partir de uma Series Mantém os índices

    Uma coluna com nome da Series ou argumento passado
  15. Operações com DataFrame Projeção In [141]: Adição In [142]: df['one']

    df['three'] = df['one'] * df['two'] df['flag'] = df['one'] > 2 df Out[141]: a 1.0 b 2.0 c 3.0 d NaN Name: one, dtype: float64 Out[142]: one two three ag a 1.0 1.0 1.0 False b 2.0 2.0 4.0 False c 3.0 3.0 9.0 True d NaN 4.0 NaN False
  16. In [143]: df['foo'] = 'bar' df Out[143]: one two three

    ag foo a 1.0 1.0 1.0 False bar b 2.0 2.0 4.0 False bar c 3.0 3.0 9.0 True bar d NaN 4.0 NaN False bar
  17. Operações com DataFrame Exclusão In [144]: In [145]: del df['two']

    three = df.pop('three') df Out[145]: one ag foo a 1.0 False bar b 2.0 False bar c 3.0 True bar d NaN False bar
  18. Criando objetos In [146]: s = pd.Series([1,3,5,np.nan,6,8]) s Out[146]: 0

    1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64
  19. Criando objetos In [147]: In [148]: dates = pd.date_range('20130101', periods=6)

    dates df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')) df Out[147]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') Out[148]: A B C D 2013-01-01 0.942514 0.457859 1.796804 -0.167335 2013-01-02 1.350054 0.372804 1.179525 -1.497362 2013-01-03 1.396315 1.669783 0.014493 0.069263 2013-01-04 -1.512057 -0.942973 -0.142368 -0.042194 2013-01-05 -0.302740 -1.613105 2.650310 1.237520 2013-01-06 1.338176 -0.302371 0.908058 2.669694
  20. Criando objetos In [149]: df2 = pd.DataFrame({ 'A' : 1.,

    'B' : pd.Timestamp('20130102'), 'C' : pd.Series(1,index=list(range(4)),dtype='float32'), 'D' : np.array([3] * 4,dtype='int32'), 'E' : pd.Categorical(["test","train","test","train"]), 'F' : 'foo' }) df2 Out[149]: A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo
  21. Entendendo os dados In [150]: In [151]: df.head() df.tail(3) Out[150]:

    A B C D 2013-01-01 0.942514 0.457859 1.796804 -0.167335 2013-01-02 1.350054 0.372804 1.179525 -1.497362 2013-01-03 1.396315 1.669783 0.014493 0.069263 2013-01-04 -1.512057 -0.942973 -0.142368 -0.042194 2013-01-05 -0.302740 -1.613105 2.650310 1.237520 Out[151]: A B C D 2013-01-04 -1.512057 -0.942973 -0.142368 -0.042194 2013-01-05 -0.302740 -1.613105 2.650310 1.237520 2013-01-06 1.338176 -0.302371 0.908058 2.669694
  22. Entendendo os dados In [152]: In [153]: In [154]: df.index

    df.columns df.values Out[152]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') Out[153]: Index(['A', 'B', 'C', 'D'], dtype='object') Out[154]: array([[ 0.94251405, 0.45785949, 1.79680407, -0.1673346 ], [ 1.350054 , 0.37280432, 1.17952484, -1.49736159], [ 1.39631522, 1.66978288, 0.01449329, 0.0692628 ], [-1.51205691, -0.94297288, -0.14236803, -0.0421941 ], [-0.30273957, -1.61310536, 2.65031003, 1.2375204 ], [ 1.33817644, -0.3023708 , 0.90805789, 2.66969448]])
  23. Entendendo os dados In [155]: df.describe() Out[155]: A B C

    D count 6.000000 6.000000 6.000000 6.000000 mean 0.535377 -0.059667 1.067804 0.378265 std 1.192442 1.157425 1.062803 1.419640 min -1.512057 -1.613105 -0.142368 -1.497362 25% 0.008574 -0.782822 0.237884 -0.136049 50% 1.140345 0.035217 1.043791 0.013534 75% 1.347085 0.436596 1.642484 0.945456 max 1.396315 1.669783 2.650310 2.669694
  24. Entendendo os dados In [156]: df.info() <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 6

    entries, 2013-01-01 to 2013-01-06 Freq: D Data columns (total 4 columns): A 6 non-null float64 B 6 non-null float64 C 6 non-null float64 D 6 non-null float64 dtypes: float64(4) memory usage: 240.0 bytes
  25. Manipulando os dados Obtendo a transposta In [157]: df.T Out[157]:

    2013-01- 01 00:00:00 2013-01- 02 00:00:00 2013-01- 03 00:00:00 2013-01- 04 00:00:00 2013-01- 05 00:00:00 2013-01- 06 00:00:00 A 0.942514 1.350054 1.396315 -1.512057 -0.302740 1.338176 B 0.457859 0.372804 1.669783 -0.942973 -1.613105 -0.302371 C 1.796804 1.179525 0.014493 -0.142368 2.650310 0.908058 D -0.167335 -1.497362 0.069263 -0.042194 1.237520 2.669694
  26. Manipulando os dados Ordenando índices In [158]: df.sort_index(axis=1, ascending=False) Out[158]:

    D C B A 2013-01-01 -0.167335 1.796804 0.457859 0.942514 2013-01-02 -1.497362 1.179525 0.372804 1.350054 2013-01-03 0.069263 0.014493 1.669783 1.396315 2013-01-04 -0.042194 -0.142368 -0.942973 -1.512057 2013-01-05 1.237520 2.650310 -1.613105 -0.302740 2013-01-06 2.669694 0.908058 -0.302371 1.338176
  27. Manipulando os dados Ordenando valores In [159]: df.sort_values(by='B') Out[159]: A

    B C D 2013-01-05 -0.302740 -1.613105 2.650310 1.237520 2013-01-04 -1.512057 -0.942973 -0.142368 -0.042194 2013-01-06 1.338176 -0.302371 0.908058 2.669694 2013-01-02 1.350054 0.372804 1.179525 -1.497362 2013-01-01 0.942514 0.457859 1.796804 -0.167335 2013-01-03 1.396315 1.669783 0.014493 0.069263
  28. Projetando os dados In [160]: df['A'] Out[160]: 2013-01-01 0.942514 2013-01-02

    1.350054 2013-01-03 1.396315 2013-01-04 -1.512057 2013-01-05 -0.302740 2013-01-06 1.338176 Freq: D, Name: A, dtype: float64
  29. Projetando os dados In [161]: In [162]: df[0:3] df['20130102':'20130104'] Out[161]:

    A B C D 2013-01-01 0.942514 0.457859 1.796804 -0.167335 2013-01-02 1.350054 0.372804 1.179525 -1.497362 2013-01-03 1.396315 1.669783 0.014493 0.069263 Out[162]: A B C D 2013-01-02 1.350054 0.372804 1.179525 -1.497362 2013-01-03 1.396315 1.669783 0.014493 0.069263 2013-01-04 -1.512057 -0.942973 -0.142368 -0.042194
  30. Projetando os dados Por rótulo In [163]: df.loc[dates[0]] Out[163]: A

    0.942514 B 0.457859 C 1.796804 D -0.167335 Name: 2013-01-01 00:00:00, dtype: float64
  31. Projetando os dados Por rótulo In [164]: df.loc[:,['A','B']] Out[164]: A

    B 2013-01-01 0.942514 0.457859 2013-01-02 1.350054 0.372804 2013-01-03 1.396315 1.669783 2013-01-04 -1.512057 -0.942973 2013-01-05 -0.302740 -1.613105 2013-01-06 1.338176 -0.302371
  32. Projetando os dados Por rótulo In [165]: df.loc['20130102':'20130104',['A','B']] Out[165]: A

    B 2013-01-02 1.350054 0.372804 2013-01-03 1.396315 1.669783 2013-01-04 -1.512057 -0.942973
  33. Projetando os dados Por rótulo In [166]: In [167]: df.loc['20130102',['A','B']]

    df.loc[dates[0],'A'] Out[166]: A 1.350054 B 0.372804 Name: 2013-01-02 00:00:00, dtype: float64 Out[167]: 0.94251404675651695
  34. Projetando os dados Por posição In [168]: In [169]: df.iloc[3]

    df.iloc[3:5,0:2] Out[168]: A -1.512057 B -0.942973 C -0.142368 D -0.042194 Name: 2013-01-04 00:00:00, dtype: float64 Out[169]: A B 2013-01-04 -1.512057 -0.942973 2013-01-05 -0.302740 -1.613105
  35. Projetando os dados Por posição In [170]: In [171]: df.iloc[[1,2,4],[0,2]]

    df.iloc[1:3,:] Out[170]: A C 2013-01-02 1.350054 1.179525 2013-01-03 1.396315 0.014493 2013-01-05 -0.302740 2.650310 Out[171]: A B C D 2013-01-02 1.350054 0.372804 1.179525 -1.497362 2013-01-03 1.396315 1.669783 0.014493 0.069263
  36. Projetando os dados Por posição In [172]: In [173]: df.iloc[:,1:3]

    df.iloc[1,1] Out[172]: B C 2013-01-01 0.457859 1.796804 2013-01-02 0.372804 1.179525 2013-01-03 1.669783 0.014493 2013-01-04 -0.942973 -0.142368 2013-01-05 -1.613105 2.650310 2013-01-06 -0.302371 0.908058 Out[173]: 0.37280432445655604
  37. Projetando os dados Por condição In [174]: In [175]: df[df.A

    > 0] df[df > 0] Out[174]: A B C D 2013-01-01 0.942514 0.457859 1.796804 -0.167335 2013-01-02 1.350054 0.372804 1.179525 -1.497362 2013-01-03 1.396315 1.669783 0.014493 0.069263 2013-01-06 1.338176 -0.302371 0.908058 2.669694 Out[175]: A B C D 2013-01-01 0.942514 0.457859 1.796804 NaN 2013-01-02 1.350054 0.372804 1.179525 NaN 2013-01-03 1.396315 1.669783 0.014493 0.069263 2013-01-04 NaN NaN NaN NaN 2013-01-05 NaN NaN 2.650310 1.237520 2013-01-06 1.338176 NaN 0.908058 2.669694
  38. Projetando os dados isin() In [176]: In [177]: df2 =

    df.copy() df2['E'] = ['one', 'one','two','three','four','three'] df2 df2[df2['E'].isin(['two','four'])] Out[176]: A B C D E 2013-01-01 0.942514 0.457859 1.796804 -0.167335 one 2013-01-02 1.350054 0.372804 1.179525 -1.497362 one 2013-01-03 1.396315 1.669783 0.014493 0.069263 two 2013-01-04 -1.512057 -0.942973 -0.142368 -0.042194 three 2013-01-05 -0.302740 -1.613105 2.650310 1.237520 four 2013-01-06 1.338176 -0.302371 0.908058 2.669694 three Out[177]: A B C D E 2013-01-03 1.396315 1.669783 0.014493 0.069263 two 2013-01-05 -0.302740 -1.613105 2.650310 1.237520 four
  39. Modi cando os dados Adicionando colunas In [178]: In [179]:

    s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6)) s1 df['F'] = s1 Out[178]: 2013-01-02 1 2013-01-03 2 2013-01-04 3 2013-01-05 4 2013-01-06 5 2013-01-07 6 Freq: D, dtype: int64
  40. Modi cando os dados Alterando valores In [180]: df Out[180]:

    A B C D F 2013-01-01 0.942514 0.457859 1.796804 -0.167335 NaN 2013-01-02 1.350054 0.372804 1.179525 -1.497362 1.0 2013-01-03 1.396315 1.669783 0.014493 0.069263 2.0 2013-01-04 -1.512057 -0.942973 -0.142368 -0.042194 3.0 2013-01-05 -0.302740 -1.613105 2.650310 1.237520 4.0 2013-01-06 1.338176 -0.302371 0.908058 2.669694 5.0
  41. Modi cando os dados Alterando valores In [181]: df.at[dates[0],'A'] =

    0 df.iat[0,1] = 0 df.loc[:,'D'] = np.array([5] * len(df)) df Out[181]: A B C D F 2013-01-01 0.000000 0.000000 1.796804 5 NaN 2013-01-02 1.350054 0.372804 1.179525 5 1.0 2013-01-03 1.396315 1.669783 0.014493 5 2.0 2013-01-04 -1.512057 -0.942973 -0.142368 5 3.0 2013-01-05 -0.302740 -1.613105 2.650310 5 4.0 2013-01-06 1.338176 -0.302371 0.908058 5 5.0
  42. Modi cando os dados Alterando valores com condições In [182]:

    df2 = df.copy() df2 Out[182]: A B C D F 2013-01-01 0.000000 0.000000 1.796804 5 NaN 2013-01-02 1.350054 0.372804 1.179525 5 1.0 2013-01-03 1.396315 1.669783 0.014493 5 2.0 2013-01-04 -1.512057 -0.942973 -0.142368 5 3.0 2013-01-05 -0.302740 -1.613105 2.650310 5 4.0 2013-01-06 1.338176 -0.302371 0.908058 5 5.0
  43. Modi cando os dados Alterando valores com condições In [183]:

    df2[df2 > 0] = -df2 df2 Out[183]: A B C D F 2013-01-01 0.000000 0.000000 -1.796804 -5 NaN 2013-01-02 -1.350054 -0.372804 -1.179525 -5 -1.0 2013-01-03 -1.396315 -1.669783 -0.014493 -5 -2.0 2013-01-04 -1.512057 -0.942973 -0.142368 -5 -3.0 2013-01-05 -0.302740 -1.613105 -2.650310 -5 -4.0 2013-01-06 -1.338176 -0.302371 -0.908058 -5 -5.0
  44. Tratando os dados In [184]: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) +

    ['E']) df1.loc[dates[0]:dates[1],'E'] = 1 df1 Out[184]: A B C D F E 2013-01-01 0.000000 0.000000 1.796804 5 NaN 1.0 2013-01-02 1.350054 0.372804 1.179525 5 1.0 1.0 2013-01-03 1.396315 1.669783 0.014493 5 2.0 NaN 2013-01-04 -1.512057 -0.942973 -0.142368 5 3.0 NaN
  45. Tratando os dados In [185]: In [186]: df1 pd.isnull(df1) Out[185]:

    A B C D F E 2013-01-01 0.000000 0.000000 1.796804 5 NaN 1.0 2013-01-02 1.350054 0.372804 1.179525 5 1.0 1.0 2013-01-03 1.396315 1.669783 0.014493 5 2.0 NaN 2013-01-04 -1.512057 -0.942973 -0.142368 5 3.0 NaN Out[186]: A B C D F E 2013-01-01 False False False False True False 2013-01-02 False False False False False False 2013-01-03 False False False False False True 2013-01-04 False False False False False True
  46. Tratando os dados In [187]: In [188]: df1.dropna(how='any') df1.fillna(value=5) Out[187]:

    A B C D F E 2013-01-02 1.350054 0.372804 1.179525 5 1.0 1.0 Out[188]: A B C D F E 2013-01-01 0.000000 0.000000 1.796804 5 5.0 1.0 2013-01-02 1.350054 0.372804 1.179525 5 1.0 1.0 2013-01-03 1.396315 1.669783 0.014493 5 2.0 5.0 2013-01-04 -1.512057 -0.942973 -0.142368 5 3.0 5.0
  47. Tratando os dados apply() In [189]: In [190]: df.apply(np.cumsum) df.apply(lambda

    x: x.max() - x.min()) Out[189]: A B C D F 2013-01-01 0.000000 0.000000 1.796804 5 NaN 2013-01-02 1.350054 0.372804 2.976329 10 1.0 2013-01-03 2.746369 2.042587 2.990822 15 3.0 2013-01-04 1.234312 1.099614 2.848454 20 6.0 2013-01-05 0.931573 -0.513491 5.498764 25 10.0 2013-01-06 2.269749 -0.815862 6.406822 30 15.0 Out[190]: A 2.908372 B 3.282888 C 2.792678 D 0.000000 F 4.000000 dtype: float64
  48. Tratando os dados apply() In [191]: df.apply(lambda x: [max(1, y)

    for y in x]) Out[191]: A B C D F 2013-01-01 1.000000 1.000000 1.796804 5 1.0 2013-01-02 1.350054 1.000000 1.179525 5 1.0 2013-01-03 1.396315 1.669783 1.000000 5 2.0 2013-01-04 1.000000 1.000000 1.000000 5 3.0 2013-01-05 1.000000 1.000000 2.650310 5 4.0 2013-01-06 1.338176 1.000000 1.000000 5 5.0
  49. Operando com os dados Estatísticas In [192]: df.mean() Out[192]: A

    0.378292 B -0.135977 C 1.067804 D 5.000000 F 3.000000 dtype: float64
  50. Operando com os dados Estatísticas In [193]: df.mean(1) Out[193]: 2013-01-01

    1.699201 2013-01-02 1.780477 2013-01-03 2.016118 2013-01-04 1.080520 2013-01-05 1.946893 2013-01-06 2.388773 Freq: D, dtype: float64
  51. Operando com os dados Estatísticas In [194]: In [195]: s

    = pd.Series(np.random.randint(0, 7, size=10)) s s.value_counts() Out[194]: 0 4 1 4 2 5 3 5 4 4 5 1 6 1 7 5 8 5 9 5 dtype: int64 Out[195]: 5 5 4 3 1 2 dtype: int64
  52. Operando com os dados Strings In [196]: In [197]: s

    = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) s s.str.lower() Out[196]: 0 A 1 B 2 C 3 Aaba 4 Baca 5 NaN 6 CABA 7 dog 8 cat dtype: object Out[197]: 0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype: object
  53. Mesclando dados Merge In [198]: df = pd.DataFrame(np.random.randn(10, 4)) df

    Out[198]: 0 1 2 3 0 -0.616159 0.441333 0.644670 2.336805 1 -0.474553 -0.398493 1.424062 -0.877346 2 -0.253474 0.265718 -2.457648 -0.231814 3 0.079574 1.797657 -0.527753 1.333898 4 -0.098189 1.196968 0.249564 -1.927953 5 -1.621114 0.080282 -0.089616 2.160123 6 0.181398 0.837641 -0.553422 -0.998054 7 0.527065 0.750490 0.248069 -0.480715 8 1.673566 0.408616 -0.297147 0.400781 9 0.711653 1.887566 -0.340432 0.284253
  54. Mesclando dados Merge In [199]: pieces = [df[:3], df[3:7], df[7:]]

    pd.concat(pieces) Out[199]: 0 1 2 3 0 -0.616159 0.441333 0.644670 2.336805 1 -0.474553 -0.398493 1.424062 -0.877346 2 -0.253474 0.265718 -2.457648 -0.231814 3 0.079574 1.797657 -0.527753 1.333898 4 -0.098189 1.196968 0.249564 -1.927953 5 -1.621114 0.080282 -0.089616 2.160123 6 0.181398 0.837641 -0.553422 -0.998054 7 0.527065 0.750490 0.248069 -0.480715 8 1.673566 0.408616 -0.297147 0.400781 9 0.711653 1.887566 -0.340432 0.284253
  55. Mesclando dados Join In [200]: In [201]: In [202]: left

    = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]}) right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]}) left right Out[201]: key lval 0 foo 1 1 foo 2 Out[202]: key rval 0 foo 4 1 foo 5
  56. Mesclando dados Join In [203]: pd.merge(left, right, on='key') Out[203]: key

    lval rval 0 foo 1 4 1 foo 1 5 2 foo 2 4 3 foo 2 5
  57. Mesclando dados Append In [204]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])

    df Out[204]: A B C D 0 -0.309317 0.244977 0.881038 0.138781 1 -1.930083 -0.680669 -1.452549 -0.150607 2 2.116961 -0.544347 -0.604566 -0.793241 3 -0.740517 0.088858 -0.119330 0.475188 4 1.724797 0.682759 0.165117 0.298184 5 0.505510 -1.446364 -1.630013 0.015845 6 0.416764 0.448405 0.746967 -0.059918 7 0.950208 -0.010430 0.167761 -0.325940
  58. Mesclando dados Append In [205]: s = df.iloc[3] s Out[205]:

    A -0.740517 B 0.088858 C -0.119330 D 0.475188 Name: 3, dtype: float64
  59. Mesclando dados Append In [206]: df.append(s, ignore_index=True) Out[206]: A B

    C D 0 -0.309317 0.244977 0.881038 0.138781 1 -1.930083 -0.680669 -1.452549 -0.150607 2 2.116961 -0.544347 -0.604566 -0.793241 3 -0.740517 0.088858 -0.119330 0.475188 4 1.724797 0.682759 0.165117 0.298184 5 0.505510 -1.446364 -1.630013 0.015845 6 0.416764 0.448405 0.746967 -0.059918 7 0.950208 -0.010430 0.167761 -0.325940 8 -0.740517 0.088858 -0.119330 0.475188
  60. Agrupando dados In [207]: df = pd.DataFrame({'A' : ['foo', 'bar',

    'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' : np.random.randn(8), 'D' : np.random.randn(8)}) df Out[207]: A B C D 0 foo one 0.329304 -1.481057 1 bar one 1.123981 -1.507624 2 foo two -0.277149 -0.110525 3 bar three 1.013107 1.072455 4 foo two -0.263508 0.635959 5 bar two -0.210274 -0.015596 6 foo one -1.456826 1.875040 7 foo three -1.748910 -0.674866
  61. Agrupando dados In [209]: df.groupby(['A','B']).sum() Out[209]: C D A B

    bar one 1.123981 -1.507624 three 1.013107 1.072455 two -0.210274 -0.015596 foo one -1.127522 0.393983 three -1.748910 -0.674866 two -0.540658 0.525433
  62. Matplotlib In [212]: ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1 000)) ts

    = ts.cumsum() ts.plot() Out[212]: <matplotlib.axes._subplots.AxesSubplot at 0x7f184c87aa58>
  63. Matplotlib In [213]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B',

    'C', 'D']) df = df.cumsum() plt.figure(); df.plot(); plt.legend(loc='best') Out[213]: <matplotlib.legend.Legend at 0x7f184c71e898> <matplotlib.figure.Figure at 0x7f184f51fa58>
  64. CSV In [214]: In [215]: df.to_csv('data/foo.csv', index_label='date') df = pd.read_csv('data/foo.csv',

    index_col='date') df.head() Out[215]: A B C D date 2000-01-01 -0.184042 0.174102 0.940474 -0.233452 2000-01-02 -0.126755 -0.524672 1.917568 0.571711 2000-01-03 -1.624824 -1.536377 2.817070 0.980146 2000-01-04 -1.535097 -1.367084 2.275337 0.119631 2000-01-05 -0.450131 -1.138307 2.793651 1.284019
  65. HDF5 In [216]: In [217]: df.to_hdf('data/foo.h5','df') df = pd.read_hdf('data/foo.h5','df') df.head()

    Out[217]: A B C D date 2000-01-01 -0.184042 0.174102 0.940474 -0.233452 2000-01-02 -0.126755 -0.524672 1.917568 0.571711 2000-01-03 -1.624824 -1.536377 2.817070 0.980146 2000-01-04 -1.535097 -1.367084 2.275337 0.119631 2000-01-05 -0.450131 -1.138307 2.793651 1.284019
  66. Excel In [218]: In [219]: df.to_excel('data/foo.xlsx', index_label='date', sheet_name='Sheet1') df =

    pd.read_excel('data/foo.xlsx', 'Sheet1', index_col='date', na_values=['NA' ]) df.head() Out[219]: A B C D date 2000-01-01 -0.184042 0.174102 0.940474 -0.233452 2000-01-02 -0.126755 -0.524672 1.917568 0.571711 2000-01-03 -1.624824 -1.536377 2.817070 0.980146 2000-01-04 -1.535097 -1.367084 2.275337 0.119631 2000-01-05 -0.450131 -1.138307 2.793651 1.284019
  67. Links Intro to Data Structures (http://pandas.pydata.org/pandas- docs/stable/dsintro.html) 10 Minutes to

    pandas (http://pandas.pydata.org/pandas- docs/stable/10min.html) PyConJP 2015: pandas internals by Sinhrks (https://speakerdeck.com/sinhrks/pyconjp-2015-pandas-internals) Pandas for Data Analysis by phanhoang17 (https://speakerdeck.com/huyhoang17/pandas-for-data-analysis)