Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NLTK vs Twitter [Max Thayer]

NLTK vs Twitter [Max Thayer]

A voyage into linguistic frontiers

PyCon Canada

August 10, 2013
Tweet

More Decks by PyCon Canada

Other Decks in Programming

Transcript

  1. NLTK VS TWITTER A VOYAGE INTO LINGUISTIC FRONTIERS Created by

    / Max Thayer @garbados Project: https://github.com/garbados/flask_listen/tree/pycon Slides: http://nltk-pyconca.maxthayer.org/
  2. FINDINGS Different parts of speech evolve together. Different parts of

    speech react to different events. In the last fifteen years, English went bonkers.
  3. TOOLS : A Twitter library for Python. : The best

    HTTP library for Python. : A puny web framework. : A CouchDB-like NoSQL database service. : Web service for manipulating geographic data. Tweepy Requests Flask Cloudant Geonames
  4. WORKER PROCESS Tweepy listens to the public Twitter stream. Geonames

    adds geo detail. Insert tweet into Cloudant. # T w e e p y c o d e s a m p l e f r o m ` l i s t e n . p y ` d e f l i s t e n ( ) : # c u s t o m l i s t e n e r f o r g e t t i n g g e o n a m e s d a t a a n d i n s e r t i n g i n t o c l o u d a n t l = C l o u d a n t L i s t e n e r ( ) # o a u t h ! a u t h = O A u t h H a n d l e r ( C o n f i g . c o n s u m e r _ k e y , C o n f i g . c o n s u m e r _ s e c r e t ) a u t h . s e t _ a c c e s s _ t o k e n ( C o n f i g . a c c e s s _ t o k e n , C o n f i g . a c c e s s _ t o k e n _ s e c r e t ) # g i v e t w i t t e r p l e n t y o f t i m e b e f o r e w e t i m e o u t s t r e a m = S t r e a m ( a u t h , l , t i m e o u t = 3 6 0 0 0 0 0 0 ) # b e g i n l i s t e n i n g ! s t r e a m . s a m p l e ( ) i f _ _ n a m e _ _ = = ' _ _ m a i n _ _ ' : l i s t e n ( )
  5. APP PROCESS Requests gets the number of tweets from Cloudant.

    Client JavaScript updates count every few seconds. # c o n t e n t s o f ` a p p . p y ` i m p o r t f l a s k i m p o r t r e q u e s t s i m p o r t o s f r o m c o n f i g i m p o r t C o n f i g # c u s t o m s e t t i n g s a p p = f l a s k . F l a s k ( _ _ n a m e _ _ ) d e f g e t _ c o u n t ( ) : # g e t t h e c u r r e n t n u m b e r o f t w e e t s i n t h e d a t a b a s e u r l = ' / ' . j o i n ( [ C o n f i g . d b _ u r l , ' _ a l l _ d o c s ' ] ) + ' ? l i m i t = 0 ' r = r e q u e s t s . g e t ( u r l ) r e t u r n r . j s o n ( ) [ ' t o t a l _ r o w s ' ] @ a p p . r o u t e ( ' / c o u n t ' ) d e f c o u n t ( ) : r e t u r n f l a s k . j s o n i f y ( { " c o u n t " : g e t _ c o u n t ( ) } ) @ a p p . r o u t e ( ' / ' ) d e f i n d e x ( ) : r e t u r n f l a s k . r e n d e r _ t e m p l a t e ( ' i n d e x . h t m l ' , c o u n t = g e t _ c o u n t ( ) ) i f _ _ n a m e _ _ = = ' _ _ m a i n _ _ ' : p o r t = i n t ( o s . e n v i r o n . g e t ( ' P O R T ' , 5 0 0 0 ) ) a p p . r u n ( p o r t = p o r t )
  6. TOKENIZING Given this: meaning while I'm just here like waiting

    for this stuff to come off http://t.co/HRout5G3Uo Let's do this: i m p o r t c l o u d a n t # h e l p e r f o r i n t e r a c t i n g w i t h C l o u d a n t f r o m n l t k i m p o r t w o r d _ t o k e n i z e # w o r d t o k e n i z e r r = c l o u d a n t . v i e w ( ' n l t k ' , ' l a n g u a g e ' , * * d i c t ( s t a l e = " o k " , r e d u c e = " f a l s e " , k e y = ' " e n " ' , l i m i t = 1 ) ) t w e e t s = m a p ( w o r d _ t o k e n i z e , [ r o w [ ' v a l u e ' ] f o r r o w i n r [ ' r o w s ' ] ] ) p r i n t t w e e t s [ 0 ]
  7. RESULT [ u ' m e a n i n

    g ' , u ' w h i l e ' , u ' I ' , u " ' m " , u ' j u s t ' , u ' h e r e ' , u ' l i k e ' , u ' w a i t i n g ' , u ' f o r ' , u ' t h i s ' , u ' s t u f f ' , u ' t o ' , u ' c o m e ' , u ' o f f ' , u ' h t t p ' , u ' : ' , u ' / / t . c o / H R o u t 5 G 3 U o ' ]
  8. PART OF SPEECH TAGGING # c o n t i

    n u i n g f r o m t h e l a s t s a m p l e . . . f r o m n l t k i m p o r t p o s _ t a g t a g g e d _ t w e e t s = m a p ( p o s _ t a g , t w e e t s ) p r i n t t a g g e d _ t w e e t s [ 0 ]
  9. RESULT [ ( u ' m e a n i

    n g ' , ' V B G ' ) , ( u ' w h i l e ' , ' I N ' ) , ( u ' I ' , ' P R P ' ) , ( u " ' m " , ' V B P ' ) , ( u ' j u s t ' , ' R B ' ) , ( u ' h e r e ' , ' R B ' ) , ( u ' l i k e ' , ' I N ' ) , ( u ' w a i t i n g ' , ' V B G ' ) , ( u ' f o r ' , ' I N ' ) , ( u ' t h i s ' , ' D T ' ) , ( u ' s t u f f ' , ' N N ' ) , ( u ' t o ' , ' T O ' ) , ( u ' c o m e ' , ' V B ' ) , ( u ' o f f ' , ' R P ' ) , ( u ' h t t p ' , ' N N ' ) , ( u ' : ' , ' : ' ) , ( u ' / / t . c o / H R o u t 5 G 3 U o ' , ' J J ' ) ] Penn Treebank Part of Speech Tags
  10. TAGGING VS HASHTAGS Given this tweet... Most Extreme Elimination Challenge=Best

    show ever #rightyouareken It tokenizes to this: [ u ' M o s t ' , u ' E x t r e m e ' , u ' E l i m i n a t i o n ' , u ' C h a l l e n g e = B e s t ' , u ' s h o w ' , u ' e v e r ' , u ' # ' , u ' r i g h t y o u a r e k e n ' ] ...and tags to this: [ ( u ' M o s t ' , ' J J S ' ) , ( u ' E x t r e m e ' , ' N N P ' ) , ( u ' E l i m i n a t i o n ' , ' N N P ' ) , ( u ' C h a l l e n g e = B e s t ' , ' N N P ' ) , ( u ' s h o w ' , ' N N ' ) , ( u ' e v e r ' , ' R B ' ) , ( u ' # ' , ' # ' ) , ( u ' r i g h t y o u a r e k e n ' , ' V B N ' ) ]
  11. WORD_TOKENIZE Alias for T r e e b a n

    k W o r d T o k e n i z e r
  12. WORDPUNCT_TOKENIZE [ u ' m e a n i n

    g ' , u ' w h i l e ' , u ' I ' , u " ' " , u ' m ' , u ' j u s t ' , u ' h e r e ' , u ' l i k e ' , u ' w a i t i n g ' , u ' f o r ' , u ' t h i s ' , u ' s t u f f ' , u ' t o ' , u ' c o m e ' , u ' o f f ' , u ' h t t p ' , u ' : / / ' , u ' t ' , u ' . ' , u ' c o ' , u ' / ' , u ' H R o u t 5 G 3 U o ' ]
  13. PUNKT [ u ' m e a n i n

    g ' , u ' w h i l e ' , u ' I ' , u " ' m " , u ' j u s t ' , u ' h e r e ' , u ' l i k e ' , u ' w a i t i n g ' , u ' f o r ' , u ' t h i s ' , u ' s t u f f ' , u ' t o ' , u ' c o m e ' , u ' o f f ' , u ' h t t p ' , u ' : ' , u ' / / t . c o / H R o u t 5 G 3 U o ' ]
  14. REGEXPTOKENIZER Pattern: \ w + | [ = ] |

    \ S + [ u ' m e a n i n g ' , u ' w h i l e ' , u ' I ' , u " ' m " , u ' j u s t ' , u ' h e r e ' , u ' l i k e ' , u ' w a i t i n g ' , u ' f o r ' , u ' t h i s ' , u ' s h i t ' , u ' t o ' , u ' c o m e ' , u ' o f f ' , u ' h t t p ' , u ' : / / t . c o / H R o u t 5 G 3 U o ' ]
  15. BACK TO TAGGING OR NOT T a g g e

    d 1 0 , 0 0 0 t w e e t s i n 1 9 9 . 9 9 3 1 3 8 0 7 5 s e c o n d s
  16. COLLECTIONS i m p o r t c o l

    l e c t i o n s # g i v e n 1 0 0 0 t w e e t s . . . t o k e n s = m a p ( R e g e x p T o k e n i z e r ( " \ w + | [ = ] | \ S + " ) . t o k e n i z e , t w e e t s ) a l l _ t o k e n s = r e d u c e ( l a m b d a x , y : x + y , t o k e n s ) u n t a g s = c o l l e c t i o n s . C o u n t e r ( a l l _ t o k e n s ) p r i n t u n t a g s . m o s t _ c o m m o n ( 1 0 ) > > > [ ( u ' . ' , 3 5 4 0 ) , ( u ' I ' , 3 3 8 4 ) , ( u ' R T ' , 2 8 7 5 ) , ( u ' t o ' , 2 0 8 4 ) , ( u ' t h e ' , 1 9 9 5 ) , ( u ' y
  17. HTML < d i v i d = " m

    a p " > < / d i v > Plus dependencies: d3.v3.min.js queue.v1.min.js topojson.js
  18. JS SETUP / / o u r w o r

    l d m a p a n d i t s d i m e n s i o n s / s c a l e v a r p r o j e c t i o n = d 3 . g e o . m e r c a t o r ( ) . t r a n s l a t e ( [ 4 8 0 , 3 0 0 ] ) . s c a l e ( 9 7 0 ) ; / / o b j e c t f o r h a n d l i n g s e r i e s o f c o o r d i n a t e s v a r p a t h = d 3 . g e o . p a t h ( ) . p r o j e c t i o n ( p r o j e c t i o n ) ; SVG / / m a p p i n g o u r m a p t o t h e # m a p D O M e l e m e n t v a r s v g = d 3 . s e l e c t ( " # m a p " ) . a p p e n d ( " s v g " ) . a t t r ( " w i d t h " , w i d t h ) . a t t r ( " h e i g h t " , h e i g h t ) ; / / s e t u p o u r t o o l t i p s v a r t o o l t i p = d 3 . s e l e c t ( " # m a p " ) . a p p e n d ( " d i v " ) . a t t r ( " c l a s s " , " t o o l t i p " ) ;
  19. QUEUE / / l o a d o u r

    r e s o u r c e s i n o r d e r q u e u e ( ) / / t o p o j s o n o f t h e w o r l d . d e f e r ( d 3 . j s o n , " s t a t i c / m a p s / w o r l d - 1 1 0 m . j s o n " ) / / T S V o f n a t i o n n a m e s a n d t h e i r I D s . d e f e r ( d 3 . t s v , " s t a t i c / m a p s / w o r l d - c o u n t r y - n a m e s . t s v " ) / / a s l o n g a s i t ' s j s o n , y o u c a n g r a b d y n a m i c c o n t e n t t o o : O . d e f e r ( d 3 . j s o n , " v i e w / g e o ? g r o u p _ l e v e l = 1 & s t a l e = o k " ) . a w a i t ( r e a d y ) ;
  20. READY f u n c t i o n r

    e a d y ( e r r o r , w o r l d , n a m e s , c o u n t s _ r o w s ) { . . . }
  21. DATA v a r c o u n t r

    y = s v g . s e l e c t A l l ( " . c o u n t r y " ) . d a t a ( c o u n t r i e s ) ; c o u n t r y . e n t e r ( ) . i n s e r t ( " p a t h " ) . a t t r ( " c l a s s " , " c o u n t r y " ) . a t t r ( " t i t l e " , f u n c t i o n ( d , i ) { r e t u r n d . n a m e ; } ) . a t t r ( " d " , p a t h ) . s t y l e ( " f i l l " , f u n c t i o n ( d , i ) { r e t u r n c o l o r ( d . c o u n t ) ; } )
  22. COLOR SCALES v a r c o l o r

    = d 3 . s c a l e . l o g ( ) . d o m a i n ( [ 1 , m o s t ] ) . r a n g e ( [ ' b l a c k ' , ' b l u e ' ] ) ;
  23. WRAPPING UP Language is complicated. Gathering data from Twitter is

    painless. ... but NLTK performance becomes an issue quickly. d3.js makes pretty graphs easy.