Mali Akmanalp - Other people's messy data (and how not to hate it!)

Mali Akmanalp - Other people's messy data (and how not to hate it!)

Have you ever viscerally hated a dataset? Do you want to just get data cleaning out of the way? Are you always left wondering how it consumes most of your time? Whether you work in the sciences, work with government data or scrape websites, data cleaning is a necessary evil. We'll share our woes and check out state of the art in day to day data cleaning tools and strategies.

https://us.pycon.org/2015/schedule/presentation/469/

D5710b3bca38f1233274b4cbc523dc4b?s=128

PyCon 2015

April 18, 2015
Tweet

Transcript

  1. Hell is Other Peopleʹs Data / Mali Akmanalp @makmanalp

  2. Mali Akmanalp / @makmanalp

  3. None
  4. None
  5. i m p o r t p a n d

    a s a s p d
  6. Military surplus gear given to law enforcement https://github.com/TheUpshot/Military‑Surplus‑Gear

  7. Reading Data In

  8. I n [ 1 ] : p d . r

    e a d _ < t a b > p d . r e a d _ c l i p b o a r d p d . r e a d _ f w f p d . r e a d _ h t m l p d . r e a d _ p i c k l e p d . r e a d _ s q l _ t a b l e p d . r e a d _ c s v p d . r e a d _ g b q p d . r e a d _ j s o n p d . r e a d _ s q l p d . r e a d _ s t a t a p d . r e a d _ e x c e l p d . r e a d _ h d f p d . r e a d _ m s g p a c k p d . r e a d _ s q l _ q u e r y p d . r e a d _ t a b l e
  9. Encoding Issues

  10. I n [ 5 ] : p d . r

    e a d _ s t a t a ( " s t a t e _ n a m e s . d t a " ) O u t [ 5 ] : r r _ n a m e 0 1 A g u a s c a l i e n t e s . . . 1 4 1 5 M à © x i c o 1 5 1 6 M i c h o a c à ¡ n d e O c a m p o
  11. I n [ 6 ] : p d . r

    e a d _ s t a t a ( " s t a t e _ n a m e s . d t a " , e n c o d i n g = " u t f - 8 " ) O u t [ 6 ] : r r _ n a m e 0 1 A g u a s c a l i e n t e s . . . 1 4 1 5 M é x i c o 1 5 1 6 M i c h o a c á n d e O c a m p o
  12. > > > i m p o r t c

    h a r d e t > > > c h a r d e t . d e t e c t ( r a w d a t a ) { ' e n c o d i n g ' : ' E U C - J P ' , ' c o n f i d e n c e ' : 0 . 9 9 }
  13. $ f i l e b o s t o

    n _ p y t h o n _ i s _ a w e s o m e . t s v b o s t o n _ p y t h o n _ i s _ a w e s o m e . t s v : U T F - 8 U n i c o d e E n g l i s h t e x t
  14. > > > f r o m u n i

    d e c o d e i m p o r t u n i d e c o d e > > > p r i n t u ' H \ x e b \ x e 4 v y M \ x e b t \ x e 4 l ' H ë ä v y M ë t ä l > > > u n i d e c o d e ( u ' H \ x e b \ x e 4 v y M \ x e b t \ x e 4 l ' ) ' H e a v y M e t a l '
  15. Taking a peek

  16. I n [ 8 9 ] : d f =

    p d . r e a d _ e x c e l ( " 1 0 3 3 - p r o g r a m - f o i a - m a y - 2 0 1 4 . x l s x " )
  17. I n [ 2 4 ] : d f .

    c o l u m n s O u t [ 2 4 ] : I n d e x ( [ u ' S t a t e ' , u ' C o u n t y ' , u ' N S N ' , u ' I t e m N a m e ' , u ' Q u a n t i t y ' , u ' U I ' , u ' A c q u i s i t i o n C o s t ' , u ' S h i p D a t e ' ] , d t y p e = ' o b j e c t ' )
  18. I n [ 6 ] : d f . d

    e s c r i b e ( ) O u t [ 6 ] : Q u a n t i t y A c q u i s i t i o n C o s t c o u n t 7 3 0 2 8 . 0 0 0 0 0 0 7 3 0 2 8 . 0 0 0 0 0 0 m e a n 1 5 . 0 0 6 7 9 2 7 9 6 7 . 5 7 5 4 9 0 s t d 3 8 4 . 6 2 3 9 3 0 1 9 7 2 9 3 . 2 4 3 3 5 6 m i n 1 . 0 0 0 0 0 0 0 . 0 0 0 0 0 0 2 5 % 1 . 0 0 0 0 0 0 5 8 . 7 1 0 0 0 0 5 0 % 1 . 0 0 0 0 0 0 2 0 0 . 0 0 0 0 0 0 7 5 % 5 . 0 0 0 0 0 0 4 9 9 . 0 0 0 0 0 0 m a x 9 1 0 0 0 . 0 0 0 0 0 0 1 8 0 0 0 0 0 0 . 0 0 0 0 0 0
  19. I n [ 9 0 ] : d f .

    i n f o ( ) < c l a s s ' p a n d a s . c o r e . f r a m e . D a t a F r a m e ' > I n t 6 4 I n d e x : 7 3 0 2 8 e n t r i e s , 0 t o 7 3 0 2 7 D a t a c o l u m n s ( t o t a l 8 c o l u m n s ) : S t a t e 7 3 0 2 8 n o n - n u l l o b j e c t C o u n t y 7 3 0 2 8 n o n - n u l l o b j e c t N S N 7 2 9 8 3 n o n - n u l l o b j e c t I t e m N a m e 7 1 7 3 2 n o n - n u l l o b j e c t Q u a n t i t y 7 3 0 2 8 n o n - n u l l i n t 6 4 U I 7 3 0 2 8 n o n - n u l l o b j e c t A c q u i s i t i o n C o s t 7 3 0 2 8 n o n - n u l l f l o a t 6 4 S h i p D a t e 7 3 0 2 8 n o n - n u l l o b j e c t d t y p e s : f l o a t 6 4 ( 1 ) , i n t 6 4 ( 1 ) , o b j e c t ( 6 )
  20. Types are important!

  21. 5 != ʹ5ʹ

  22. 5 != 5.0

  23. Adding nulls make int columns floats! I n [ 3

    2 ] : x = p d . r e a d _ c s v ( S t r i n g I O ( " a , b \ n 5 , 6 . 0 \ n , " ) ) I n [ 3 3 ] : x O u t [ 3 3 ] : a b 0 5 6 1 N a N N a N I n [ 3 4 ] : x . i n f o ( ) < c l a s s ' p a n d a s . c o r e . f r a m e . D a t a F r a m e ' > I n t 6 4 I n d e x : 2 e n t r i e s , 0 t o 1 D a t a c o l u m n s ( t o t a l 2 c o l u m n s ) : a 1 n o n - n u l l f l o a t 6 4 b 1 n o n - n u l l f l o a t 6 4 d t y p e s : f l o a t 6 4 ( 2 ) m e m o r y u s a g e : 4 8 . 0 b y t e s
  24. Merging on different types makes data disappear

  25. Converting types p d . r e a d _

    c s v ( . . . , d t y p e = { " c o l u m n _ 1 " : i n t , " c o l u m n _ 2 " : o b j e c t } ) # o r d f . c o l u m n = d f . c o l u m n . a s t y p e ( i n t )
  26. Kinds of nothing No data available: None Nully values: ʺʺ,

    0
  27. The Semantics of Types

  28. Identifiers (Not numbers!) 617‑555‑1234 721‑07‑1426 01605

  29. Categoricals (Not strings!) red, green, blue, orange A, B, AB,

    0 noun, verb, adjective, adverb
  30. Ordinals (Also not strings!) low, medium, high ★, ★★, ★★★,

    ★★★★ high school, undergrad, graduate
  31. Dealing with missing data

  32. Converting custom N/A values p d . r e a

    d _ c s v ( . . . , n a _ v a l u e s = [ " N / A " , " U n k n o w n " ] ) # o r d f . r e p l a c e ( " N / A " , N o n e )
  33. Dropping nulls O u t [ 6 5 ] :

    a b c d 0 1 2 N a N N a N 1 N a N 3 N a N N a N 2 4 5 N a N N a N
  34. I n [ 6 8 ] : d f .

    d r o p n a ( a x i s = 1 ) O u t [ 6 8 ] : b 0 2 1 3 2 5
  35. I n [ 6 9 ] : d f .

    d r o p n a ( a x i s = 1 , h o w = " a l l " ) O u t [ 6 9 ] : a b 0 1 2 1 N a N 3 2 4 5
  36. I n [ 8 6 ] : d f .

    f i l l n a ( m e t h o d = " b f i l l " ) O u t [ 8 6 ] : a b c d 0 1 2 N a N N a N 1 4 3 N a N N a N 2 4 5 N a N N a N
  37. I n [ 9 0 ] : d f .

    i n t e r p o l a t e ( ) O u t [ 9 0 ] : a b c d 0 1 . 0 2 N a N N a N 1 2 . 5 3 N a N N a N 2 4 . 0 5 N a N N a N
  38. Playing with data

  39. Selecting columns I n [ 5 0 ] : d

    f . S t a t e O u t [ 5 0 ] : 0 A K 1 A K 2 A K . . . I n [ 5 1 ] : d f [ " I t e m N a m e " ] O u t [ 5 1 ] : 0 R I F L E , 5 . 5 6 M I L L I M E T E R 1 R I F L E , 5 . 5 6 M I L L I M E T E R 2 R I F L E , 5 . 5 6 M I L L I M E T E R 3 R I F L E , 5 . 5 6 M I L L I M E T E R 4 R I F L E , 5 . 5 6 M I L L I M E T E R . . .
  40. Selecting multiple columns I n [ 5 4 ] :

    d f [ [ " S t a t e " , " C o u n t y " ] ] O u t [ 5 4 ] : S t a t e C o u n t y 0 A K A N C H O R A G E 1 A K A N C H O R A G E 2 A K A N C H O R A G E
  41. Queries I n [ 5 8 ] : d f

    [ " A c q u i s i t i o n C o s t " ] > 1 0 0 0 0 0 O u t [ 5 8 ] : 0 F a l s e 1 F a l s e 2 F a l s e 3 F a l s e 4 F a l s e 5 F a l s e . . .
  42. Queries I n [ 5 9 ] : d f

    [ d f [ " A c q u i s i t i o n C o s t " ] > 1 0 0 0 0 0 ] O u t [ 5 9 ] : S t a t e C o u n t y N S N \ 1 4 6 A K A N C H O R A G E 2 3 5 5 - D S - C O M - B T V 2 5 1 5 A L B A L D W I N 2 3 2 0 - 0 1 - 0 4 7 - 8 7 5 4 6 6 3 A L B L O U N T 2 3 2 0 - 0 1 - 0 4 7 - 8 7 5 4 6 7 4 A L B L O U N T 2 3 2 0 - 0 1 - 2 3 0 - 0 3 0 4 6 9 3 A L B L O U N T 2 3 5 5 - 0 1 - 5 5 5 - 0 9 0 8 . . . I n [ 6 0 ] : d f [ d f [ " A c q u i s i t i o n C o s t " ] > 1 0 0 0 0 0 ] . c o u n t ( ) O u t [ 6 0 ] : S t a t e 6 4 6 C o u n t y 6 4 6 N S N 6 4 5 I t e m N a m e 6 4 1 . . .
  43. What were the highest cost items? I n [ 3

    8 ] : d f . s o r t ( " A c q u i s i t i o n C o s t " , a s c e n d i n g = F a l s e ) \ [ [ " I t e m N a m e " , " A c q u i s i t i o n C o s t " ] ] O u t [ 3 8 ] : I t e m N a m e A c q u i s i t i o n C o s t 6 5 7 5 4 A I R C R A F T , R O T A R Y W I N G 1 8 0 0 0 0 0 0 6 5 7 5 3 A I R C R A F T , R O T A R Y W I N G 1 8 0 0 0 0 0 0 6 5 7 6 0 A I R C R A F T , R O T A R Y W I N G 1 8 0 0 0 0 0 0 6 5 7 5 9 A I R C R A F T , R O T A R Y W I N G 1 8 0 0 0 0 0 0 6 5 7 5 8 A I R C R A F T , R O T A R Y W I N G 1 8 0 0 0 0 0 0 6 5 7 5 7 A I R C R A F T , R O T A R Y W I N G 1 8 0 0 0 0 0 0 6 5 7 5 6 A I R C R A F T , R O T A R Y W I N G 1 8 0 0 0 0 0 0 6 5 7 5 5 A I R C R A F T , R O T A R Y W I N G 1 8 0 0 0 0 0 0 4 8 0 7 4 A I R C R A F T , R O T A R Y W I N G 6 5 0 0 0 0 0 1 9 5 2 4 A I R P L A N E , C A R G O - T R A N S P O R T 5 3 4 0 0 0 0 6 5 6 7 8 A I R C R A F T , F I X E D W I N G 4 3 1 7 0 6 7 2 1 5 0 R A D A R S U R V E I L L A N C E C E N T R A L 1 5 2 6 7 0 3
  44. What were the largest quantities of items? I n [

    1 5 ] : d f . s o r t ( " Q u a n t i t y " , a s c e n d i n g = F a l s e ) \ [ [ " I t e m N a m e " , " Q u a n t i t y " , " U I " ] ] O u t [ 1 5 ] : I t e m N a m e Q u a n t i t y U I 5 2 5 3 0 W I R E , E L E C T R I C A L 9 1 0 0 0 F o o t 3 6 2 2 1 S C R E W , C A P , S O C K E T H E A D 4 3 8 2 2 E a c h 5 2 3 9 9 S T R A P , T I E D O W N , E L E C T R I C A L C O M P O N E N T S 6 0 0 0 E a c h 5 2 5 3 6 C A B L E C O A X 6 0 0 0 F T 3 9 1 8 9 R U B B E R S H E E T , S O L I D 6 0 0 0 E a c h
  45. value_counts() is awesome I n [ 1 6 ] :

    d f . U I . v a l u e _ c o u n t s ( ) O u t [ 1 6 ] : E a c h 5 1 5 8 1 E A 1 3 3 7 0 P a i r 1 3 8 1 P R 1 2 8 5 U n k n o w n 9 8 2 K i t 7 3 2 S e t 4 6 3 K T 4 4 1 S E 3 6 0
  46. I n [ 1 7 ] : d f =

    d f . r e p l a c e ( { " U I " : [ " E A " , " E A C H " , " P R " . . . ] } , { " U I " : [ " E a c h " , " E a c h " , " P a i r " . . . ] } ) I n [ 1 8 ] : d f = d f . r e p l a c e ( { " U I " : " U n k n o w n " } , { " U I " : n p . N a N } )
  47. I n [ 2 2 ] : d f .

    U I . v a l u e _ c o u n t s ( ) [ 2 0 : ] O u t [ 2 2 ] : A s s o r t m e n t 1 0 G L 9 C a n 8 O T 7 B D 7 . . . J R 1 E 4 1 S k e i n 1 U U 1 S P 1 L e n g t h : 5 4 , d t y p e : i n t 6 4
  48. d o n t c a r e = d

    f . U I . v a l u e _ c o u n t s ( ) [ 2 0 : ] d f = d f [ ~ d f . U I . i s i n ( d o n t c a r e . i n d e x ) ]
  49. I n [ 4 6 ] : d f .

    s o r t ( " Q u a n t i t y " , a s c e n d i n g = F a l s e ) \ . . . . : [ [ " I t e m N a m e " , " Q u a n t i t y " , " U I " ] ] O u t [ 4 6 ] : I t e m N a m e Q u a n t i t y U I 5 2 5 3 0 W I R E , E L E C T R I C A L 9 1 0 0 0 F o o t 3 6 2 2 1 S C R E W , C A P , S O C K E T H E A D 4 3 8 2 2 E a c h 5 2 3 9 9 S T R A P , T I E D O W N , E L E C T R I C A L C O M P O N E N T S 6 0 0 0 E a c h 5 2 5 3 6 C A B L E C O A X 6 0 0 0 F o o t
  50. Regexes and other string functions > > > d f

    . c o l u m n [ d f . c o l u m n . s t r . c o n t a i n s ( " ( \ d { 4 } - \ d { 2 } - \ d { 2 } ) " ) ] > > > d f [ " I t e m N a m e " ] . s t r . l o w e r ( )
  51. Pivoting I n [ 1 ] : d f O

    u t [ 1 ] : d a t e v a r i a b l e v a l u e 0 2 0 0 0 - 0 1 - 0 3 A 0 . 4 6 9 1 1 2 1 2 0 0 0 - 0 1 - 0 4 A - 0 . 2 8 2 8 6 3 2 2 0 0 0 - 0 1 - 0 5 A - 1 . 5 0 9 0 5 9 3 2 0 0 0 - 0 1 - 0 3 B - 1 . 1 3 5 6 3 2 4 2 0 0 0 - 0 1 - 0 4 B 1 . 2 1 2 1 1 2 5 2 0 0 0 - 0 1 - 0 5 B - 0 . 1 7 3 2 1 5 . . . .
  52. Pivoting I n [ 3 ] : d f .

    p i v o t ( i n d e x = ' d a t e ' , c o l u m n s = ' v a r i a b l e ' , v a l u e s = ' v a l u e ' ) O u t [ 3 ] : v a r i a b l e A B C D d a t e 2 0 0 0 - 0 1 - 0 3 0 . 4 6 9 1 1 2 - 1 . 1 3 5 6 3 2 0 . 1 1 9 2 0 9 - 2 . 1 0 4 5 6 9 2 0 0 0 - 0 1 - 0 4 - 0 . 2 8 2 8 6 3 1 . 2 1 2 1 1 2 - 1 . 0 4 4 2 3 6 - 0 . 4 9 4 9 2 9 2 0 0 0 - 0 1 - 0 5 - 1 . 5 0 9 0 5 9 - 0 . 1 7 3 2 1 5 - 0 . 8 6 1 8 4 9 1 . 0 7 1 8 0 4
  53. Transforming with functions I n [ 1 2 8 ]

    : d f 4 O u t [ 1 2 8 ] : o n e t h r e e t w o a - 0 . 6 2 6 5 4 4 N a N - 0 . 3 5 1 5 8 7 b - 0 . 1 3 8 8 9 4 - 0 . 1 7 7 2 8 9 1 . 1 3 6 2 4 9 c 0 . 0 1 1 6 1 7 0 . 4 6 2 2 1 5 - 0 . 4 4 8 7 8 9 d N a N 1 . 1 2 4 4 7 2 - 1 . 1 0 1 5 5 8 I n [ 1 2 9 ] : f = l a m b d a x : l e n ( s t r ( x ) ) I n [ 1 3 0 ] : d f 4 [ ' o n e ' ] . m a p ( f ) O u t [ 1 3 0 ] : a 1 4 b 1 5 c 1 5 d 3 N a m e : o n e , d t y p e : i n t 6 4
  54. Merging Datasets

  55. Bane of everyoneʹs existence Make sure the types match Check

    dataframe size before and after Try with how=inner and how=outer
  56. I n [ 8 3 ] : a g e

    s O u t [ 8 3 ] : a g e c a t s 0 1 s i m o n 1 2 p h o e b e 2 3 n o r m a n I n [ 8 4 ] : w e i g h t s O u t [ 8 4 ] : c a t s w e i g h t 0 s i m o n 1 1 n o r m a n 2
  57. I n [ 8 6 ] : p d .

    m e r g e ( a g e s , w e i g h t s , l e f t _ o n = " c a t s " , r i g h t _ o n = " c a t s " ) O u t [ 8 6 ] : a g e c a t s w e i g h t 0 1 s i m o n 1 1 3 n o r m a n 2
  58. I n [ 8 8 ] : p d .

    m e r g e ( a g e s , w e i g h t s , l e f t _ o n = " c a t s " , r i g h t _ o n = " c a t s " , h o w = " o u t e r " ) O u t [ 8 8 ] : a g e c a t s w e i g h t 0 1 s i m o n 1 1 2 p h o e b e N a N 2 3 n o r m a n 2
  59. fuzzywuzzy and jellyfish

  60. > > > f r o m f u z

    z y w u z z y i m p o r t f u z z > > > f u z z . r a t i o ( " t h i s i s a t e s t " , " t h i s i s a t e s t ! " ) 9 6
  61. Graphs!

  62. ggplot bokeh seaborn matplotlib glue

  63. PDFs suck!

  64. Scaling up

  65. Donʹt let notebooks get in the way of reusable code

  66. Sometimes repeatability matters Build tools: Make, tup OKFN Bubbles, ETLs,

    Hadoop, Storm, etc
  67. Be conscious about what you load into memory p d

    . r e a d _ c s v ( . . . , u s e c o l s = [ " b l a h " ] ) p d . r e a d _ c s v ( . . . , i t e r a t o r = T r u e , c h u n k s i z e = 1 0 0 0 0 0 )
  68. Push things down into the pandas / numpy layer

  69. Takeaways

  70. Use the proper types for things

  71. Data has a tendency to be used in unanticipated ways

  72. Documentation matters

  73. Fix data before you need it fixed!

  74. Data cleaning is a necessary evil

  75. Thank you! @makmanalp