Explorando dados com Pandas

Explorando dados com Pandas Felipe Pontes

Agenda O que é pandas? Estruturas de dados Funcionalidades Integrações

github.com/felipemfp/python-day-natal-2017 (https:/ /github.com/felipemfp/python-day-natal- 2017)

O que é pandas?

pandas is an open source, BSD-licensed library providing high- performance,
easy-to-use data structures and data analysis tools for the Python programming language

O que é pandas? Biblioteca open-source PANel Data System +11k
no GitHub

numpy ndarray e matriz Acesso por índice Tipo único de
dados In [117]: índice 0 1 2 3 elemento 1 3 5 7 import numpy as np arr = np.array([1, 3, 5, 7], dtype=np.int64) arr Out[117]: array([1, 3, 5, 7])

pandas Construído em cima do numpy Acesso por rótulo (índice
ou coluna) Tipos varíados de dados

Por quê? Aplicável para o mundo real Estruturas de dados
intuitivas Baterias inclusas para preparação, análise e exploração de dados

pandas no ecossistema

Estruturas de Dados In [118]: import pandas as pd

Estruturas de Dados

Series Lista rotulada de 1 dimensão Qualquer tipo de dado
inteiros strings decimais objetos s = pd.Series(data, index=index)

Criando Series A partir de um np.ndarray In [119]: In
[120]: In [121]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e']) s s.index pd.Series(np.random.randn(5)) Out[119]: a -0.363051 b 0.657138 c -1.482879 d 0.019643 e -0.009886 dtype: float64 Out[120]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object') Out[121]: 0 0.158455 1 0.758866 2 -0.933195 3 0.785129 4 0.420429 dtype: float64

Criando Series A partir de um dict In [122]: In
[123]: d = {'a' : 0., 'b' : 1., 'c' : 2.} pd.Series(d) pd.Series(d, index=['b', 'c', 'd', 'a']) NaN signi ca "Not a Number" e é um marcador padrão para valores que estão faltando. Out[122]: a 0.0 b 1.0 c 2.0 dtype: float64 Out[123]: b 1.0 c 2.0 d NaN a 0.0 dtype: float64

Criando Series A partir de um valor In [124]: pd.Series(5.,
index=['a', 'b', 'c', 'd', 'e']) Out[124]: a 5.0 b 5.0 c 5.0 d 5.0 e 5.0 dtype: float64

Series são como np.ndarray In [125]: In [126]: In [127]:
s[0] s[:3] s[[4, 3, 1]] Out[125]: -0.3630508545612619 Out[126]: a -0.363051 b 0.657138 c -1.482879 dtype: float64 Out[127]: e -0.009886 d 0.019643 b 0.657138 dtype: float64

Series são como dict In [128]: In [129]: In [130]:
In [131]: In [132]: s['a'] s['e'] = 12. s 'e' in s 'f' in s Out[128]: -0.3630508545612619 Out[130]: a -0.363051 b 0.657138 c -1.482879 d 0.019643 e 12.000000 dtype: float64 Out[131]: True Out[132]: False

Operações com Series In [133]: In [134]: In [135]: s
+ s s * 3 s[1:] + s[:-1] Out[133]: a -0.726102 b 1.314275 c -2.965758 d 0.039287 e 24.000000 dtype: float64 Out[134]: a -1.089153 b 1.971413 c -4.448637 d 0.058930 e 36.000000 dtype: float64 Out[135]: a NaN b 1.314275 c -2.965758 d 0.039287 e NaN dtype: float64

DataFrame Composição de Series Colunas com diferentes tipos df =
pd.DataFrame(data, index=index)

Criando DataFrames A partir de um dict In [136]: d
= {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) df Out[136]: one two a 1.0 1.0 b 2.0 2.0 c 3.0 3.0 d NaN 4.0

Criando DataFrames A partir de um dict In [137]: In
[138]: pd.DataFrame(d, index=['d', 'b', 'a']) pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three']) Out[137]: one two d NaN 4.0 b 2.0 2.0 a 1.0 1.0 Out[138]: two three d 4.0 NaN b 2.0 NaN a 1.0 NaN

Criando DataFrames A partir de uma lista de dict In
[139]: In [140]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}] pd.DataFrame(data2) pd.DataFrame(data2, index=['first', 'second']) Out[139]: a b c 0 1 2 NaN 1 5 10 20.0 Out[140]: a b c rst 1 2 NaN second 5 10 20.0

Criando DataFrames A partir de uma Series Mantém os índices
Uma coluna com nome da Series ou argumento passado

Operações com DataFrame Projeção In [141]: Adição In [142]: df['one']
df['three'] = df['one'] * df['two'] df['flag'] = df['one'] > 2 df Out[141]: a 1.0 b 2.0 c 3.0 d NaN Name: one, dtype: float64 Out[142]: one two three ag a 1.0 1.0 1.0 False b 2.0 2.0 4.0 False c 3.0 3.0 9.0 True d NaN 4.0 NaN False

In [143]: df['foo'] = 'bar' df Out[143]: one two three
ag foo a 1.0 1.0 1.0 False bar b 2.0 2.0 4.0 False bar c 3.0 3.0 9.0 True bar d NaN 4.0 NaN False bar

Operações com DataFrame Exclusão In [144]: In [145]: del df['two']
three = df.pop('three') df Out[145]: one ag foo a 1.0 False bar b 2.0 False bar c 3.0 True bar d NaN False bar

Panel Composição de DataFrames Descontinuado

Funcionalidades

Criando objetos In [146]: s = pd.Series([1,3,5,np.nan,6,8]) s Out[146]: 0
1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64

Criando objetos In [147]: In [148]: dates = pd.date_range('20130101', periods=6)
dates df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')) df Out[147]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') Out[148]: A B C D 2013-01-01 0.942514 0.457859 1.796804 -0.167335 2013-01-02 1.350054 0.372804 1.179525 -1.497362 2013-01-03 1.396315 1.669783 0.014493 0.069263 2013-01-04 -1.512057 -0.942973 -0.142368 -0.042194 2013-01-05 -0.302740 -1.613105 2.650310 1.237520 2013-01-06 1.338176 -0.302371 0.908058 2.669694

Criando objetos In [149]: df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'), 'C' : pd.Series(1,index=list(range(4)),dtype='float32'), 'D' : np.array([3] * 4,dtype='int32'), 'E' : pd.Categorical(["test","train","test","train"]), 'F' : 'foo' }) df2 Out[149]: A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo

Entendendo os dados In [150]: In [151]: df.head() df.tail(3) Out[150]:
A B C D 2013-01-01 0.942514 0.457859 1.796804 -0.167335 2013-01-02 1.350054 0.372804 1.179525 -1.497362 2013-01-03 1.396315 1.669783 0.014493 0.069263 2013-01-04 -1.512057 -0.942973 -0.142368 -0.042194 2013-01-05 -0.302740 -1.613105 2.650310 1.237520 Out[151]: A B C D 2013-01-04 -1.512057 -0.942973 -0.142368 -0.042194 2013-01-05 -0.302740 -1.613105 2.650310 1.237520 2013-01-06 1.338176 -0.302371 0.908058 2.669694

Entendendo os dados In [152]: In [153]: In [154]: df.index
df.columns df.values Out[152]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') Out[153]: Index(['A', 'B', 'C', 'D'], dtype='object') Out[154]: array([[ 0.94251405, 0.45785949, 1.79680407, -0.1673346 ], [ 1.350054 , 0.37280432, 1.17952484, -1.49736159], [ 1.39631522, 1.66978288, 0.01449329, 0.0692628 ], [-1.51205691, -0.94297288, -0.14236803, -0.0421941 ], [-0.30273957, -1.61310536, 2.65031003, 1.2375204 ], [ 1.33817644, -0.3023708 , 0.90805789, 2.66969448]])

Entendendo os dados In [155]: df.describe() Out[155]: A B C
D count 6.000000 6.000000 6.000000 6.000000 mean 0.535377 -0.059667 1.067804 0.378265 std 1.192442 1.157425 1.062803 1.419640 min -1.512057 -1.613105 -0.142368 -1.497362 25% 0.008574 -0.782822 0.237884 -0.136049 50% 1.140345 0.035217 1.043791 0.013534 75% 1.347085 0.436596 1.642484 0.945456 max 1.396315 1.669783 2.650310 2.669694

Entendendo os dados In [156]: df.info() <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 6
entries, 2013-01-01 to 2013-01-06 Freq: D Data columns (total 4 columns): A 6 non-null float64 B 6 non-null float64 C 6 non-null float64 D 6 non-null float64 dtypes: float64(4) memory usage: 240.0 bytes

Manipulando os dados Obtendo a transposta In [157]: df.T Out[157]:
2013-01- 01 00:00:00 2013-01- 02 00:00:00 2013-01- 03 00:00:00 2013-01- 04 00:00:00 2013-01- 05 00:00:00 2013-01- 06 00:00:00 A 0.942514 1.350054 1.396315 -1.512057 -0.302740 1.338176 B 0.457859 0.372804 1.669783 -0.942973 -1.613105 -0.302371 C 1.796804 1.179525 0.014493 -0.142368 2.650310 0.908058 D -0.167335 -1.497362 0.069263 -0.042194 1.237520 2.669694

Manipulando os dados Ordenando índices In [158]: df.sort_index(axis=1, ascending=False) Out[158]:
D C B A 2013-01-01 -0.167335 1.796804 0.457859 0.942514 2013-01-02 -1.497362 1.179525 0.372804 1.350054 2013-01-03 0.069263 0.014493 1.669783 1.396315 2013-01-04 -0.042194 -0.142368 -0.942973 -1.512057 2013-01-05 1.237520 2.650310 -1.613105 -0.302740 2013-01-06 2.669694 0.908058 -0.302371 1.338176

Manipulando os dados Ordenando valores In [159]: df.sort_values(by='B') Out[159]: A
B C D 2013-01-05 -0.302740 -1.613105 2.650310 1.237520 2013-01-04 -1.512057 -0.942973 -0.142368 -0.042194 2013-01-06 1.338176 -0.302371 0.908058 2.669694 2013-01-02 1.350054 0.372804 1.179525 -1.497362 2013-01-01 0.942514 0.457859 1.796804 -0.167335 2013-01-03 1.396315 1.669783 0.014493 0.069263

Projetando os dados In [160]: df['A'] Out[160]: 2013-01-01 0.942514 2013-01-02
1.350054 2013-01-03 1.396315 2013-01-04 -1.512057 2013-01-05 -0.302740 2013-01-06 1.338176 Freq: D, Name: A, dtype: float64

Projetando os dados In [161]: In [162]: df[0:3] df['20130102':'20130104'] Out[161]:
A B C D 2013-01-01 0.942514 0.457859 1.796804 -0.167335 2013-01-02 1.350054 0.372804 1.179525 -1.497362 2013-01-03 1.396315 1.669783 0.014493 0.069263 Out[162]: A B C D 2013-01-02 1.350054 0.372804 1.179525 -1.497362 2013-01-03 1.396315 1.669783 0.014493 0.069263 2013-01-04 -1.512057 -0.942973 -0.142368 -0.042194

Projetando os dados Por rótulo In [163]: df.loc[dates[0]] Out[163]: A
0.942514 B 0.457859 C 1.796804 D -0.167335 Name: 2013-01-01 00:00:00, dtype: float64

Projetando os dados Por rótulo In [164]: df.loc[:,['A','B']] Out[164]: A
B 2013-01-01 0.942514 0.457859 2013-01-02 1.350054 0.372804 2013-01-03 1.396315 1.669783 2013-01-04 -1.512057 -0.942973 2013-01-05 -0.302740 -1.613105 2013-01-06 1.338176 -0.302371

Projetando os dados Por rótulo In [165]: df.loc['20130102':'20130104',['A','B']] Out[165]: A
B 2013-01-02 1.350054 0.372804 2013-01-03 1.396315 1.669783 2013-01-04 -1.512057 -0.942973

Projetando os dados Por rótulo In [166]: In [167]: df.loc['20130102',['A','B']]
df.loc[dates[0],'A'] Out[166]: A 1.350054 B 0.372804 Name: 2013-01-02 00:00:00, dtype: float64 Out[167]: 0.94251404675651695

Projetando os dados Por posição In [168]: In [169]: df.iloc[3]
df.iloc[3:5,0:2] Out[168]: A -1.512057 B -0.942973 C -0.142368 D -0.042194 Name: 2013-01-04 00:00:00, dtype: float64 Out[169]: A B 2013-01-04 -1.512057 -0.942973 2013-01-05 -0.302740 -1.613105

Projetando os dados Por posição In [170]: In [171]: df.iloc[[1,2,4],[0,2]]
df.iloc[1:3,:] Out[170]: A C 2013-01-02 1.350054 1.179525 2013-01-03 1.396315 0.014493 2013-01-05 -0.302740 2.650310 Out[171]: A B C D 2013-01-02 1.350054 0.372804 1.179525 -1.497362 2013-01-03 1.396315 1.669783 0.014493 0.069263

Projetando os dados Por posição In [172]: In [173]: df.iloc[:,1:3]
df.iloc[1,1] Out[172]: B C 2013-01-01 0.457859 1.796804 2013-01-02 0.372804 1.179525 2013-01-03 1.669783 0.014493 2013-01-04 -0.942973 -0.142368 2013-01-05 -1.613105 2.650310 2013-01-06 -0.302371 0.908058 Out[173]: 0.37280432445655604

Projetando os dados Por condição In [174]: In [175]: df[df.A
> 0] df[df > 0] Out[174]: A B C D 2013-01-01 0.942514 0.457859 1.796804 -0.167335 2013-01-02 1.350054 0.372804 1.179525 -1.497362 2013-01-03 1.396315 1.669783 0.014493 0.069263 2013-01-06 1.338176 -0.302371 0.908058 2.669694 Out[175]: A B C D 2013-01-01 0.942514 0.457859 1.796804 NaN 2013-01-02 1.350054 0.372804 1.179525 NaN 2013-01-03 1.396315 1.669783 0.014493 0.069263 2013-01-04 NaN NaN NaN NaN 2013-01-05 NaN NaN 2.650310 1.237520 2013-01-06 1.338176 NaN 0.908058 2.669694

Projetando os dados isin() In [176]: In [177]: df2 =
df.copy() df2['E'] = ['one', 'one','two','three','four','three'] df2 df2[df2['E'].isin(['two','four'])] Out[176]: A B C D E 2013-01-01 0.942514 0.457859 1.796804 -0.167335 one 2013-01-02 1.350054 0.372804 1.179525 -1.497362 one 2013-01-03 1.396315 1.669783 0.014493 0.069263 two 2013-01-04 -1.512057 -0.942973 -0.142368 -0.042194 three 2013-01-05 -0.302740 -1.613105 2.650310 1.237520 four 2013-01-06 1.338176 -0.302371 0.908058 2.669694 three Out[177]: A B C D E 2013-01-03 1.396315 1.669783 0.014493 0.069263 two 2013-01-05 -0.302740 -1.613105 2.650310 1.237520 four

Modi cando os dados Adicionando colunas In [178]: In [179]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6)) s1 df['F'] = s1 Out[178]: 2013-01-02 1 2013-01-03 2 2013-01-04 3 2013-01-05 4 2013-01-06 5 2013-01-07 6 Freq: D, dtype: int64

Modi cando os dados Alterando valores In [180]: df Out[180]:
A B C D F 2013-01-01 0.942514 0.457859 1.796804 -0.167335 NaN 2013-01-02 1.350054 0.372804 1.179525 -1.497362 1.0 2013-01-03 1.396315 1.669783 0.014493 0.069263 2.0 2013-01-04 -1.512057 -0.942973 -0.142368 -0.042194 3.0 2013-01-05 -0.302740 -1.613105 2.650310 1.237520 4.0 2013-01-06 1.338176 -0.302371 0.908058 2.669694 5.0

Modi cando os dados Alterando valores In [181]: df.at[dates[0],'A'] =
0 df.iat[0,1] = 0 df.loc[:,'D'] = np.array([5] * len(df)) df Out[181]: A B C D F 2013-01-01 0.000000 0.000000 1.796804 5 NaN 2013-01-02 1.350054 0.372804 1.179525 5 1.0 2013-01-03 1.396315 1.669783 0.014493 5 2.0 2013-01-04 -1.512057 -0.942973 -0.142368 5 3.0 2013-01-05 -0.302740 -1.613105 2.650310 5 4.0 2013-01-06 1.338176 -0.302371 0.908058 5 5.0

Modi cando os dados Alterando valores com condições In [182]:
df2 = df.copy() df2 Out[182]: A B C D F 2013-01-01 0.000000 0.000000 1.796804 5 NaN 2013-01-02 1.350054 0.372804 1.179525 5 1.0 2013-01-03 1.396315 1.669783 0.014493 5 2.0 2013-01-04 -1.512057 -0.942973 -0.142368 5 3.0 2013-01-05 -0.302740 -1.613105 2.650310 5 4.0 2013-01-06 1.338176 -0.302371 0.908058 5 5.0

Modi cando os dados Alterando valores com condições In [183]:
df2[df2 > 0] = -df2 df2 Out[183]: A B C D F 2013-01-01 0.000000 0.000000 -1.796804 -5 NaN 2013-01-02 -1.350054 -0.372804 -1.179525 -5 -1.0 2013-01-03 -1.396315 -1.669783 -0.014493 -5 -2.0 2013-01-04 -1.512057 -0.942973 -0.142368 -5 -3.0 2013-01-05 -0.302740 -1.613105 -2.650310 -5 -4.0 2013-01-06 -1.338176 -0.302371 -0.908058 -5 -5.0

Tratando os dados In [184]: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) +
['E']) df1.loc[dates[0]:dates[1],'E'] = 1 df1 Out[184]: A B C D F E 2013-01-01 0.000000 0.000000 1.796804 5 NaN 1.0 2013-01-02 1.350054 0.372804 1.179525 5 1.0 1.0 2013-01-03 1.396315 1.669783 0.014493 5 2.0 NaN 2013-01-04 -1.512057 -0.942973 -0.142368 5 3.0 NaN

Tratando os dados In [185]: In [186]: df1 pd.isnull(df1) Out[185]:
A B C D F E 2013-01-01 0.000000 0.000000 1.796804 5 NaN 1.0 2013-01-02 1.350054 0.372804 1.179525 5 1.0 1.0 2013-01-03 1.396315 1.669783 0.014493 5 2.0 NaN 2013-01-04 -1.512057 -0.942973 -0.142368 5 3.0 NaN Out[186]: A B C D F E 2013-01-01 False False False False True False 2013-01-02 False False False False False False 2013-01-03 False False False False False True 2013-01-04 False False False False False True

Tratando os dados In [187]: In [188]: df1.dropna(how='any') df1.fillna(value=5) Out[187]:
A B C D F E 2013-01-02 1.350054 0.372804 1.179525 5 1.0 1.0 Out[188]: A B C D F E 2013-01-01 0.000000 0.000000 1.796804 5 5.0 1.0 2013-01-02 1.350054 0.372804 1.179525 5 1.0 1.0 2013-01-03 1.396315 1.669783 0.014493 5 2.0 5.0 2013-01-04 -1.512057 -0.942973 -0.142368 5 3.0 5.0

Tratando os dados apply() In [189]: In [190]: df.apply(np.cumsum) df.apply(lambda
x: x.max() - x.min()) Out[189]: A B C D F 2013-01-01 0.000000 0.000000 1.796804 5 NaN 2013-01-02 1.350054 0.372804 2.976329 10 1.0 2013-01-03 2.746369 2.042587 2.990822 15 3.0 2013-01-04 1.234312 1.099614 2.848454 20 6.0 2013-01-05 0.931573 -0.513491 5.498764 25 10.0 2013-01-06 2.269749 -0.815862 6.406822 30 15.0 Out[190]: A 2.908372 B 3.282888 C 2.792678 D 0.000000 F 4.000000 dtype: float64

Tratando os dados apply() In [191]: df.apply(lambda x: [max(1, y)
for y in x]) Out[191]: A B C D F 2013-01-01 1.000000 1.000000 1.796804 5 1.0 2013-01-02 1.350054 1.000000 1.179525 5 1.0 2013-01-03 1.396315 1.669783 1.000000 5 2.0 2013-01-04 1.000000 1.000000 1.000000 5 3.0 2013-01-05 1.000000 1.000000 2.650310 5 4.0 2013-01-06 1.338176 1.000000 1.000000 5 5.0

Operando com os dados Estatísticas In [192]: df.mean() Out[192]: A
0.378292 B -0.135977 C 1.067804 D 5.000000 F 3.000000 dtype: float64

Operando com os dados Estatísticas In [193]: df.mean(1) Out[193]: 2013-01-01
1.699201 2013-01-02 1.780477 2013-01-03 2.016118 2013-01-04 1.080520 2013-01-05 1.946893 2013-01-06 2.388773 Freq: D, dtype: float64

Operando com os dados Estatísticas In [194]: In [195]: s
= pd.Series(np.random.randint(0, 7, size=10)) s s.value_counts() Out[194]: 0 4 1 4 2 5 3 5 4 4 5 1 6 1 7 5 8 5 9 5 dtype: int64 Out[195]: 5 5 4 3 1 2 dtype: int64

Operando com os dados Strings In [196]: In [197]: s
= pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) s s.str.lower() Out[196]: 0 A 1 B 2 C 3 Aaba 4 Baca 5 NaN 6 CABA 7 dog 8 cat dtype: object Out[197]: 0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype: object

Mesclando dados Merge In [198]: df = pd.DataFrame(np.random.randn(10, 4)) df
Out[198]: 0 1 2 3 0 -0.616159 0.441333 0.644670 2.336805 1 -0.474553 -0.398493 1.424062 -0.877346 2 -0.253474 0.265718 -2.457648 -0.231814 3 0.079574 1.797657 -0.527753 1.333898 4 -0.098189 1.196968 0.249564 -1.927953 5 -1.621114 0.080282 -0.089616 2.160123 6 0.181398 0.837641 -0.553422 -0.998054 7 0.527065 0.750490 0.248069 -0.480715 8 1.673566 0.408616 -0.297147 0.400781 9 0.711653 1.887566 -0.340432 0.284253

Mesclando dados Merge In [199]: pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces) Out[199]: 0 1 2 3 0 -0.616159 0.441333 0.644670 2.336805 1 -0.474553 -0.398493 1.424062 -0.877346 2 -0.253474 0.265718 -2.457648 -0.231814 3 0.079574 1.797657 -0.527753 1.333898 4 -0.098189 1.196968 0.249564 -1.927953 5 -1.621114 0.080282 -0.089616 2.160123 6 0.181398 0.837641 -0.553422 -0.998054 7 0.527065 0.750490 0.248069 -0.480715 8 1.673566 0.408616 -0.297147 0.400781 9 0.711653 1.887566 -0.340432 0.284253

Mesclando dados Join In [200]: In [201]: In [202]: left
= pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]}) right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]}) left right Out[201]: key lval 0 foo 1 1 foo 2 Out[202]: key rval 0 foo 4 1 foo 5

Mesclando dados Join In [203]: pd.merge(left, right, on='key') Out[203]: key
lval rval 0 foo 1 4 1 foo 1 5 2 foo 2 4 3 foo 2 5

Mesclando dados Append In [204]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df Out[204]: A B C D 0 -0.309317 0.244977 0.881038 0.138781 1 -1.930083 -0.680669 -1.452549 -0.150607 2 2.116961 -0.544347 -0.604566 -0.793241 3 -0.740517 0.088858 -0.119330 0.475188 4 1.724797 0.682759 0.165117 0.298184 5 0.505510 -1.446364 -1.630013 0.015845 6 0.416764 0.448405 0.746967 -0.059918 7 0.950208 -0.010430 0.167761 -0.325940

Mesclando dados Append In [205]: s = df.iloc[3] s Out[205]:
A -0.740517 B 0.088858 C -0.119330 D 0.475188 Name: 3, dtype: float64

Mesclando dados Append In [206]: df.append(s, ignore_index=True) Out[206]: A B
C D 0 -0.309317 0.244977 0.881038 0.138781 1 -1.930083 -0.680669 -1.452549 -0.150607 2 2.116961 -0.544347 -0.604566 -0.793241 3 -0.740517 0.088858 -0.119330 0.475188 4 1.724797 0.682759 0.165117 0.298184 5 0.505510 -1.446364 -1.630013 0.015845 6 0.416764 0.448405 0.746967 -0.059918 7 0.950208 -0.010430 0.167761 -0.325940 8 -0.740517 0.088858 -0.119330 0.475188

Agrupando dados In [207]: df = pd.DataFrame({'A' : ['foo', 'bar',
'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' : np.random.randn(8), 'D' : np.random.randn(8)}) df Out[207]: A B C D 0 foo one 0.329304 -1.481057 1 bar one 1.123981 -1.507624 2 foo two -0.277149 -0.110525 3 bar three 1.013107 1.072455 4 foo two -0.263508 0.635959 5 bar two -0.210274 -0.015596 6 foo one -1.456826 1.875040 7 foo three -1.748910 -0.674866

Agrupando dados In [208]: df.groupby('A').sum() Out[208]: C D A bar
1.926814 -0.450765 foo -3.417090 0.244550

Agrupando dados In [209]: df.groupby(['A','B']).sum() Out[209]: C D A B
bar one 1.123981 -1.507624 three 1.013107 1.072455 two -0.210274 -0.015596 foo one -1.127522 0.393983 three -1.748910 -0.674866 two -0.540658 0.525433

Integrações In [211]: import matplotlib.pyplot as plt

Matplotlib In [212]: ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1 000)) ts
= ts.cumsum() ts.plot() Out[212]: <matplotlib.axes._subplots.AxesSubplot at 0x7f184c87aa58>

Matplotlib In [213]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B',
'C', 'D']) df = df.cumsum() plt.figure(); df.plot(); plt.legend(loc='best') Out[213]: <matplotlib.legend.Legend at 0x7f184c71e898> <matplotlib.figure.Figure at 0x7f184f51fa58>

CSV In [214]: In [215]: df.to_csv('data/foo.csv', index_label='date') df = pd.read_csv('data/foo.csv',
index_col='date') df.head() Out[215]: A B C D date 2000-01-01 -0.184042 0.174102 0.940474 -0.233452 2000-01-02 -0.126755 -0.524672 1.917568 0.571711 2000-01-03 -1.624824 -1.536377 2.817070 0.980146 2000-01-04 -1.535097 -1.367084 2.275337 0.119631 2000-01-05 -0.450131 -1.138307 2.793651 1.284019

HDF5 In [216]: In [217]: df.to_hdf('data/foo.h5','df') df = pd.read_hdf('data/foo.h5','df') df.head()
Out[217]: A B C D date 2000-01-01 -0.184042 0.174102 0.940474 -0.233452 2000-01-02 -0.126755 -0.524672 1.917568 0.571711 2000-01-03 -1.624824 -1.536377 2.817070 0.980146 2000-01-04 -1.535097 -1.367084 2.275337 0.119631 2000-01-05 -0.450131 -1.138307 2.793651 1.284019

Excel In [218]: In [219]: df.to_excel('data/foo.xlsx', index_label='date', sheet_name='Sheet1') df =
pd.read_excel('data/foo.xlsx', 'Sheet1', index_col='date', na_values=['NA' ]) df.head() Out[219]: A B C D date 2000-01-01 -0.184042 0.174102 0.940474 -0.233452 2000-01-02 -0.126755 -0.524672 1.917568 0.571711 2000-01-03 -1.624824 -1.536377 2.817070 0.980146 2000-01-04 -1.535097 -1.367084 2.275337 0.119631 2000-01-05 -0.450131 -1.138307 2.793651 1.284019

Obrigado! Dúvidas? Felipe Pontes @felipemfp [email protected]

Links Intro to Data Structures (http://pandas.pydata.org/pandas- docs/stable/dsintro.html) 10 Minutes to
pandas (http://pandas.pydata.org/pandas- docs/stable/10min.html) PyConJP 2015: pandas internals by Sinhrks (https://speakerdeck.com/sinhrks/pyconjp-2015-pandas-internals) Pandas for Data Analysis by phanhoang17 (https://speakerdeck.com/huyhoang17/pandas-for-data-analysis)

Explorando dados com Pandas

Explorando dados com Pandas

More Decks by Felipe Pontes

Other Decks in Programming

Featured

Transcript