Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

In [3]: import pandas as pd import numpy as np

Slide 4

Slide 4 text

NaN Gotchas

Slide 5

Slide 5 text

Why NaN values are important? Why NaN values are important?

Slide 6

Slide 6 text

In [4]: s = pd.Series([1, 2, 3, 4, 5, 6], index=list('abcdef')) s Out[4]: a 1 b 2 c 3 d 4 e 5 f 6 dtype: int64

Slide 7

Slide 7 text

In [5]: s.dtype Out[5]: dtype('int64')

Slide 8

Slide 8 text

In [6]: #3 s2 = s.reindex(['a', 'b', 'c', 'h', 'e', 'r']) s2 Out[6]: a 1.0 b 2.0 c 3.0 h NaN e 5.0 r NaN dtype: float64

Slide 9

Slide 9 text

In [7]: s.dtype Out[7]: dtype('int64')

Slide 10

Slide 10 text

In [8]: s2.dtype Out[8]: dtype('float64')

Slide 11

Slide 11 text

In [9]: series=pd.Series([1,2,3,4]) df=pd.DataFrame(index=[3,4,5,6]) df['col']=series df Out[9]: col 3 4.0 4 NaN 5 NaN 6 NaN

Slide 12

Slide 12 text

In [10]: #Solution df['col']=series.values df Out[10]: col 3 1 4 2 5 3 6 4

Slide 13

Slide 13 text

In [11]: df=pd.DataFrame({'col':[1,np.nan]}) df==np.nan Out[11]: col 0 False 1 False

Slide 14

Slide 14 text

In [12]: df=pd.DataFrame({'col':[1,np.nan]}) df.isnull() Out[12]: col 0 False 1 True

Slide 15

Slide 15 text

why ?? why ??

Slide 16

Slide 16 text

Lack of NA value support in Numpy Lack of NA value support in Numpy then why not make it like R

Slide 17

Slide 17 text

Numpy has way more data types than R Pandas replaces all the NA value with NaN which then changes the data type to either float or object

Slide 18

Slide 18 text

In [13]: Tab=pd.DataFrame(['Promotion dtype for storing NAs','No Change','No Change','Cast t o Float64','Cast to object'],index=['Typeclass','floating','object','integer','bool ean']) Tab Out[13]: 0 Typeclass Promo�on dtype for storing NAs floa�ng No Change object No Change integer Cast to Float64 boolean Cast to object

Slide 19

Slide 19 text

Reindexing Gotchas Reindexing Gotchas

Slide 20

Slide 20 text

In [14]: df = pd.DataFrame(np.random.randn(6, 4), columns=['one', 'two', 'three', 'four'],in dex=list('abcdef')) df Out[14]: one two three four a -0.491977 -0.148958 0.302686 -0.126440 b -0.481760 0.657904 -0.105482 -1.085279 c 1.815573 -0.588777 -1.677981 -0.981797 d -0.061162 -1.439965 0.146472 -0.955991 e -1.525030 1.510161 0.002977 -1.066108 f -1.620642 -0.827015 -1.296184 0.230389

Slide 21

Slide 21 text

In [15]: df.ix[['b', 'c', 'e']] Out[15]: one two three four b -0.481760 0.657904 -0.105482 -1.085279 c 1.815573 -0.588777 -1.677981 -0.981797 e -1.525030 1.510161 0.002977 -1.066108

Slide 22

Slide 22 text

In [16]: df.reindex(['b', 'c', 'e']) Out[16]: one two three four b -0.481760 0.657904 -0.105482 -1.085279 c 1.815573 -0.588777 -1.677981 -0.981797 e -1.525030 1.510161 0.002977 -1.066108

Slide 23

Slide 23 text

In [17]: df.ix[[1, 2, 4]] Out[17]: one two three four b -0.481760 0.657904 -0.105482 -1.085279 c 1.815573 -0.588777 -1.677981 -0.981797 e -1.525030 1.510161 0.002977 -1.066108

Slide 24

Slide 24 text

In [18]: df.reindex([1, 2, 4]) Out[18]: one two three four 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 4 NaN NaN NaN NaN

Slide 25

Slide 25 text

Reindex being strictly label based indexing can't perform this Reindex being strictly label based indexing can't perform this function function

Slide 26

Slide 26 text

In [19]: series = pd.Series([1, 2, 3, 4, 5, 6]) series Out[19]: 0 1 1 2 2 3 3 4 4 5 5 6 dtype: int64

Slide 27

Slide 27 text

In [20]: true = pd.Series([True]) true.dtype Out[20]: dtype('bool')

Slide 28

Slide 28 text

In [22]: true = pd.Series([True]).reindex_like(series) true.dtype Out[22]: dtype('O')

Slide 29

Slide 29 text

the dtype changes to python object because reindex_like silently inserts the dtype changes to python object because reindex_like silently inserts NaNs and the dtype changes accordingly. NaNs and the dtype changes accordingly.

Slide 30

Slide 30 text

Some other tips Some other tips Avoid mul�threading while using pandas as it may delete some frames or dataframe.copy might skip some frames while handdling very large scale data sets Avoid using with BS4.

Slide 31

Slide 31 text

links links ### ### ### ### h�p:/ /pandas.pydata.org/pandas-docs/stable/gotchas.html (h�p:/ /pandas.pydata.org/pandas-docs/stable/gotchas.html) h�p:/ /docs.python-guide.org/en/latest/wri�ng/gotchas/#late-binding- closures (h�p:/ /docs.python-guide.org/en/latest/wri�ng/gotchas/#late- binding-closures) h�ps:/ /gist.github.com/manojpandey /41b90cba1fd62095e247d1b2448ef85b (h�ps:/ /gist.github.com /manojpandey/41b90cba1fd62095e247d1b2448ef85b) h�p:/ /pandas.pydata.org/pandas-docs/version/0.19.2/gotchas.html (h�p:/ /pandas.pydata.org/pandas-docs/version/0.19.2/gotchas.html)

Slide 32

Slide 32 text

In [ ]: In [ ]: In [ ]: