Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

I am a Machine Learning (ML) and Natural Language Processing enthusiast. For my university dissertation I created a realtime sentiment analysis classifier for Twitter. My talk will be about the experience and the lessons learned. I will explain how to build a scalable machine learning software as a service, consumable with a REST API. The purpose of this talk is not to dig into the mathematics behind machine learning (as I do not have this experience), but it’s more about showing how easy it can be to build a ML SaaS by using some of the amazing libraries such as NLTK, ZMQ and MrJob that have helped me make throughout the development. This talk will give several benefits: users with no ML background will have a great introduction to the subject, they will also be able to replicate my project at home. More experienced users will gain new ideas to put in practice and (most) probably build a better system than mine! Finally, I will attach a GitHub project with the slides and a finished product.

D547949cc9256f649a3519c8a1673f14?s=128

Daniel Pyrathon

May 24, 2014
Tweet

Transcript

  1. MACHINE LEARNING AS A SERVICE MAKING SENTIMENT PREDICTIONS IN REALTIME

    WITH ZMQ AND NLTK
  2. ABOUT ME

  3. DISSERTATION Let's make something cool!

  4. SOCIAL MEDIA + MACHINE LEARNING + API

  5. SENTIMENT ANALYSIS AS A SERVICE A STEP-BY-STEP GUIDE

  6. Fundamental Topics Machine Learning Natural Language Processing Overview of the

    platform The process Prepare Analyze Train Use Scale
  7. MACHINE LEARNING WHAT IS MACHINE LEARNING? A method of teaching

    computers to make and improve predictions or behaviors based on some data. It allow computers to evolve behaviors based on empirical data Data can be anything Stock market prices Sensors and motors email metadata
  8. SUPERVISED MACHINE LEARNING SPAM OR HAM

  9. SUPERVISED MACHINE LEARNING SPAM OR HAM

  10. SUPERVISED MACHINE LEARNING SPAM OR HAM

  11. SUPERVISED MACHINE LEARNING SPAM OR HAM

  12. SUPERVISED MACHINE LEARNING SPAM OR HAM

  13. SUPERVISED MACHINE LEARNING SPAM OR HAM

  14. SUPERVISED MACHINE LEARNING SPAM OR HAM

  15. SUPERVISED MACHINE LEARNING SPAM OR HAM

  16. NATURAL LANGUAGE PROCESSING WHAT IS NATURAL LANGUAGE PROCESSING? Interactions between

    computers and human languages Extract information from text Some NLTK features Bigrams Part-or-speech Tokenization Stemming WordNet lookup
  17. NATURAL LANGUAGE PROCESSING SOME NLTK FEATURES Tokentization Stopword Removal >

    > > p h r a s e = " I w i s h t o b u y s p e c i f i e d p r o d u c t s o r s e r v i c e " > > > p h r a s e = n l p . t o k e n i z e ( p h r a s e ) > > > p h r a s e [ ' I ' , ' w i s h ' , ' t o ' , ' b u y ' , ' s p e c i f i e d ' , ' p r o d u c t s ' , ' o r ' , ' s e r v i c e ' ] > > > p h r a s e = n l p . r e m o v e _ s t o p w o r d s ( t o k e n i z e d _ p h r a s e ) > > > p h r a s e [ ' I ' , ' w i s h ' , ' b u y ' , ' s p e c i f i e d ' , ' p r o d u c t s ' , ' s e r v i c e ' ]
  18. SENTIMENT ANALYSIS

  19. CLASSIFYING TWITTER SENTIMENT IS HARD Improper language use Spelling mistakes

    160 characters to express sentiment Different types of english (US, UK, Pidgin) Gr8 picutre..God bless u RT @WhatsNextInGosp: Resurrection Sunday Service @PFCNY with @Donnieradio pic.twitter.com/nOgz65cpY5 7:04 PM - 21 Apr 2014 Donnie McClurkin @Donnieradio Follow 8 RETWEETS 36 FAVORITES
  20. BACK TO BUILDING OUR API .. FINALLY!

  21. CLASSIFIER 3 STEPS

  22. THE DATASET SENTIMENT140 160.000 labelled tweets CSV format Polarity of

    the tweet (0 = negative, 2 = neutral, 4 = positive) The text of the tweet (Lyx is cool)
  23. FEATURE EXTRACTION How are we going to find features from

    a phrase? "Bag of Words" representation m y _ p h r a s e = " T o d a y w a s s u c h a r a i n y a n d h o r r i b l e d a y " I n [ 1 2 ] : f r o m n l t k i m p o r t w o r d _ t o k e n i z e I n [ 1 3 ] : w o r d _ t o k e n i z e ( m y _ p h r a s e ) O u t [ 1 3 ] : [ ' T o d a y ' , ' w a s ' , ' s u c h ' , ' a ' , ' r a i n y ' , ' a n d ' , ' h o r r i b l e ' , ' d a y ' ]
  24. FEATURE EXTRACTION CREATE A PIPELINE OF FEATURE EXTRACTORS F O

    R M A T T E R = f o r m a t t i n g . F o r m a t t e r P i p e l i n e ( f o r m a t t i n g . m a k e _ l o w e r c a s e , f o r m a t t i n g . s t r i p _ u r l s , f o r m a t t i n g . s t r i p _ h a s h t a g s , f o r m a t t i n g . s t r i p _ n a m e s , f o r m a t t i n g . r e m o v e _ r e p e t i t o n s , f o r m a t t i n g . r e p l a c e _ h t m l _ e n t i t i e s , f o r m a t t i n g . s t r i p _ n o n c h a r s , f u n c t o o l s . p a r t i a l ( f o r m a t t i n g . r e m o v e _ n o i s e , s t o p w o r d s = s t o p w o r d s . w o r d s ( ' e n g l i s h ' ) + [ ' r t ' ] ) , f u n c t o o l s . p a r t i a l ( f o r m a t t i n g . s t e m _ w o r d s , s t e m m e r = n l t k . s t e m . p o r t e r . P o r t e r S t e m m e r ( ) ) )
  25. FEATURE EXTRACTION PASS THE REPRESENTATION DOWN THE PIPELINE I n

    [ 1 1 ] : f e a t u r e _ e x t r a c t o r . e x t r a c t ( " T o d a y w a s s u c h a r a i n y a n d h o r r i b l e d a y " ) O u t [ 1 1 ] : { ' d a y ' : T r u e , ' h o r r i b l ' : T r u e , ' r a i n i ' : T r u e , ' t o d a y ' : T r u e } The result is a dictionary of variable length, containing keys as features and values as always True
  26. DIMENSIONALITY REDUCTION Remove features that are common across all classes

    (noise) Increase performance of the classifier Decrease the size of the model, less memory usage and more speed
  27. DIMENSIONALITY REDUCTION CHI-SQUARE TEST

  28. DIMENSIONALITY REDUCTION CHI-SQUARE TEST

  29. DIMENSIONALITY REDUCTION CHI-SQUARE TEST

  30. DIMENSIONALITY REDUCTION CHI-SQUARE TEST

  31. DIMENSIONALITY REDUCTION CHI-SQUARE TEST

  32. NLTK gives us BigramAssocMeasures.chi_sq DIMENSIONALITY REDUCTION CHI-SQUARE TEST # C

    a l c u l a t e t h e n u m b e r o f w o r d s f o r e a c h c l a s s p o s _ w o r d _ c o u n t = l a b e l _ w o r d _ f d [ ' p o s ' ] . N ( ) n e g _ w o r d _ c o u n t = l a b e l _ w o r d _ f d [ ' n e g ' ] . N ( ) t o t a l _ w o r d _ c o u n t = p o s _ w o r d _ c o u n t + n e g _ w o r d _ c o u n t # F o r e a c h w o r d a n d i t ' s t o t a l o c c u r a n c e f o r w o r d , f r e q i n w o r d _ f d . i t e r i t e m s ( ) : # C a l c u l a t e a s c o r e f o r t h e p o s i t i v e c l a s s p o s _ s c o r e = B i g r a m A s s o c M e a s u r e s . c h i _ s q ( l a b e l _ w o r d _ f d [ ' p o s ' ] [ w o r d ] , ( f r e q , p o s _ w o r d _ c o u n t ) , t o t a l _ w o r d _ c o u n t ) # C a l c u l a t e a s c o r e f o r t h e n e g a t i v e c l a s s n e g _ s c o r e = B i g r a m A s s o c M e a s u r e s . c h i _ s q ( l a b e l _ w o r d _ f d [ ' n e g ' ] [ w o r d ] , ( f r e q , n e g _ w o r d _ c o u n t ) , t o t a l _ w o r d _ c o u n t ) # T h e s u m o f t h e t w o w i l l g i v e y o u i t ' s t o t a l s c o r e w o r d _ s c o r e s [ w o r d ] = p o s _ s c o r e + n e g _ s c o r e
  33. TRAINING Now that we can extract features from text, we

    can train a classifier. The simplest and most flexible learning algorithm for text classification is Naive Bayes P ( l a b e l | f e a t u r e s ) = P ( l a b e l ) * P ( f e a t u r e s | l a b e l ) / P ( f e a t u r e s ) Simple to compute = fast Assumes feature indipendence = easy to update Supports multiclass = scalable
  34. TRAINING NLTK provides built-in components 1. Train the classifier 2.

    Serialize classifier for later use 3. Train once, use as much as you want > > > f r o m n l t k . c l a s s i f y i m p o r t N a i v e B a y e s C l a s s i f i e r > > > n b _ c l a s s i f i e r = N a i v e B a y e s C l a s s i f i e r . t r a i n ( t r a i n _ f e a t s ) . . . w a i t a l o t o f t i m e > > > n b _ c l a s s i f i e r . l a b e l s ( ) [ ' n e g ' , ' p o s ' ] > > > s e r i a l i z e r . d u m p ( n b _ c l a s s i f i e r , f i l e _ h a n d l e )
  35. USING THE CLASSIFIER # L o a d t h

    e c l a s s i f i e r f r o m t h e s e r i a l i z e d f i l e c l a s s i f i e r = p i c k l e . l o a d s ( c l a s s i f i e r _ f i l e . r e a d ( ) ) # P i c k a n e w p h r a s e n e w _ p h r a s e = " A t P y c o n I t a l y ! L o v e t h e f o o d a n d t h i s s p e a k e r i s s o a m a z i n g " # 1 ) P r e p r o c e s s i n g f e a t u r e _ v e c t o r = f e a t u r e _ e x t r a c t o r . e x t r a c t ( p h r a s e ) # 2 ) D i m e n s i o n a l i t y r e d u c t i o n , b e s t _ f e a t u r e s i s o u r s e t o f b e s t w o r d s r e d u c e d _ f e a t u r e _ v e c t o r = r e d u c e _ f e a t u r e s ( f e a t u r e _ v e c t o r , b e s t _ f e a t u r e s ) # 3 ) C l a s s i f y ! p r i n t s e l f . c l a s s i f i e r . c l a s s i f y ( r e d u c e d _ f e a t u r e _ v e c t o r ) > > > " p o s "
  36. BUILDING A CLASSIFICATION API Classifier is slow, no matter how

    much optimization is made Classifier is a blocking process, API must be event-driven
  37. BUILDING A CLASSIFICATION API SCALING TOWARDS INFINITY AND BEYOND

  38. BUILDING A CLASSIFICATION API ZEROMQ Fast, uses native sockets Promotes

    horizontal scalability Language-agnostic framework
  39. BUILDING A CLASSIFICATION API ZEROMQ . . . s o

    c k e t = c o n t e x t . s o c k e t ( z m q . R E P ) . . . w h i l e T r u e : m e s s a g e = s o c k e t . r e c v ( ) p h r a s e = j s o n . l o a d s ( m e s s a g e ) [ " t e x t " ] # 1 ) F e a t u r e e x t r a c t i o n f e a t u r e _ v e c t o r = f e a t u r e _ e x t r a c t o r . e x t r a c t ( p h r a s e ) # 2 ) D i m e n s i o n a l i t y r e d u c t i o n , b e s t _ f e a t u r e s i s o u r s e t o f b e s t w o r d s r e d u c e d _ f e a t u r e _ v e c t o r = r e d u c e _ f e a t u r e s ( f e a t u r e _ v e c t o r , b e s t _ f e a t u r e s ) # 3 ) C l a s s i f y ! r e s u l t = c l a s s i f i e r . c l a s s i f y ( r e d u c e d _ f e a t u r e _ v e c t o r ) s o c k e t . s e n d ( j s o n . d u m p s ( r e s u l t ) )
  40. DEMO

  41. POST-MORTEM Real-time sentiment analysis APIs can be implemented, and can

    be scalable What if we use Redis instead of having serialized classifiers? Deep learning is giving very good results in NLP, let's try it!
  42. FIN QUESTIONS