Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning as a Service: making sentiment...

Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

I am a Machine Learning (ML) and Natural Language Processing enthusiast. For my university dissertation I created a realtime sentiment analysis classifier for Twitter. My talk will be about the experience and the lessons learned. I will explain how to build a scalable machine learning software as a service, consumable with a REST API. The purpose of this talk is not to dig into the mathematics behind machine learning (as I do not have this experience), but it’s more about showing how easy it can be to build a ML SaaS by using some of the amazing libraries such as NLTK, ZMQ and MrJob that have helped me make throughout the development. This talk will give several benefits: users with no ML background will have a great introduction to the subject, they will also be able to replicate my project at home. More experienced users will gain new ideas to put in practice and (most) probably build a better system than mine! Finally, I will attach a GitHub project with the slides and a finished product.

Daniel Pyrathon

May 24, 2014
Tweet

More Decks by Daniel Pyrathon

Other Decks in Programming

Transcript

  1. Fundamental Topics Machine Learning Natural Language Processing Overview of the

    platform The process Prepare Analyze Train Use Scale
  2. MACHINE LEARNING WHAT IS MACHINE LEARNING? A method of teaching

    computers to make and improve predictions or behaviors based on some data. It allow computers to evolve behaviors based on empirical data Data can be anything Stock market prices Sensors and motors email metadata
  3. NATURAL LANGUAGE PROCESSING WHAT IS NATURAL LANGUAGE PROCESSING? Interactions between

    computers and human languages Extract information from text Some NLTK features Bigrams Part-or-speech Tokenization Stemming WordNet lookup
  4. NATURAL LANGUAGE PROCESSING SOME NLTK FEATURES Tokentization Stopword Removal >

    > > p h r a s e = " I w i s h t o b u y s p e c i f i e d p r o d u c t s o r s e r v i c e " > > > p h r a s e = n l p . t o k e n i z e ( p h r a s e ) > > > p h r a s e [ ' I ' , ' w i s h ' , ' t o ' , ' b u y ' , ' s p e c i f i e d ' , ' p r o d u c t s ' , ' o r ' , ' s e r v i c e ' ] > > > p h r a s e = n l p . r e m o v e _ s t o p w o r d s ( t o k e n i z e d _ p h r a s e ) > > > p h r a s e [ ' I ' , ' w i s h ' , ' b u y ' , ' s p e c i f i e d ' , ' p r o d u c t s ' , ' s e r v i c e ' ]
  5. CLASSIFYING TWITTER SENTIMENT IS HARD Improper language use Spelling mistakes

    160 characters to express sentiment Different types of english (US, UK, Pidgin) Gr8 picutre..God bless u RT @WhatsNextInGosp: Resurrection Sunday Service @PFCNY with @Donnieradio pic.twitter.com/nOgz65cpY5 7:04 PM - 21 Apr 2014 Donnie McClurkin @Donnieradio Follow 8 RETWEETS 36 FAVORITES
  6. THE DATASET SENTIMENT140 160.000 labelled tweets CSV format Polarity of

    the tweet (0 = negative, 2 = neutral, 4 = positive) The text of the tweet (Lyx is cool)
  7. FEATURE EXTRACTION How are we going to find features from

    a phrase? "Bag of Words" representation m y _ p h r a s e = " T o d a y w a s s u c h a r a i n y a n d h o r r i b l e d a y " I n [ 1 2 ] : f r o m n l t k i m p o r t w o r d _ t o k e n i z e I n [ 1 3 ] : w o r d _ t o k e n i z e ( m y _ p h r a s e ) O u t [ 1 3 ] : [ ' T o d a y ' , ' w a s ' , ' s u c h ' , ' a ' , ' r a i n y ' , ' a n d ' , ' h o r r i b l e ' , ' d a y ' ]
  8. FEATURE EXTRACTION CREATE A PIPELINE OF FEATURE EXTRACTORS F O

    R M A T T E R = f o r m a t t i n g . F o r m a t t e r P i p e l i n e ( f o r m a t t i n g . m a k e _ l o w e r c a s e , f o r m a t t i n g . s t r i p _ u r l s , f o r m a t t i n g . s t r i p _ h a s h t a g s , f o r m a t t i n g . s t r i p _ n a m e s , f o r m a t t i n g . r e m o v e _ r e p e t i t o n s , f o r m a t t i n g . r e p l a c e _ h t m l _ e n t i t i e s , f o r m a t t i n g . s t r i p _ n o n c h a r s , f u n c t o o l s . p a r t i a l ( f o r m a t t i n g . r e m o v e _ n o i s e , s t o p w o r d s = s t o p w o r d s . w o r d s ( ' e n g l i s h ' ) + [ ' r t ' ] ) , f u n c t o o l s . p a r t i a l ( f o r m a t t i n g . s t e m _ w o r d s , s t e m m e r = n l t k . s t e m . p o r t e r . P o r t e r S t e m m e r ( ) ) )
  9. FEATURE EXTRACTION PASS THE REPRESENTATION DOWN THE PIPELINE I n

    [ 1 1 ] : f e a t u r e _ e x t r a c t o r . e x t r a c t ( " T o d a y w a s s u c h a r a i n y a n d h o r r i b l e d a y " ) O u t [ 1 1 ] : { ' d a y ' : T r u e , ' h o r r i b l ' : T r u e , ' r a i n i ' : T r u e , ' t o d a y ' : T r u e } The result is a dictionary of variable length, containing keys as features and values as always True
  10. DIMENSIONALITY REDUCTION Remove features that are common across all classes

    (noise) Increase performance of the classifier Decrease the size of the model, less memory usage and more speed
  11. NLTK gives us BigramAssocMeasures.chi_sq DIMENSIONALITY REDUCTION CHI-SQUARE TEST # C

    a l c u l a t e t h e n u m b e r o f w o r d s f o r e a c h c l a s s p o s _ w o r d _ c o u n t = l a b e l _ w o r d _ f d [ ' p o s ' ] . N ( ) n e g _ w o r d _ c o u n t = l a b e l _ w o r d _ f d [ ' n e g ' ] . N ( ) t o t a l _ w o r d _ c o u n t = p o s _ w o r d _ c o u n t + n e g _ w o r d _ c o u n t # F o r e a c h w o r d a n d i t ' s t o t a l o c c u r a n c e f o r w o r d , f r e q i n w o r d _ f d . i t e r i t e m s ( ) : # C a l c u l a t e a s c o r e f o r t h e p o s i t i v e c l a s s p o s _ s c o r e = B i g r a m A s s o c M e a s u r e s . c h i _ s q ( l a b e l _ w o r d _ f d [ ' p o s ' ] [ w o r d ] , ( f r e q , p o s _ w o r d _ c o u n t ) , t o t a l _ w o r d _ c o u n t ) # C a l c u l a t e a s c o r e f o r t h e n e g a t i v e c l a s s n e g _ s c o r e = B i g r a m A s s o c M e a s u r e s . c h i _ s q ( l a b e l _ w o r d _ f d [ ' n e g ' ] [ w o r d ] , ( f r e q , n e g _ w o r d _ c o u n t ) , t o t a l _ w o r d _ c o u n t ) # T h e s u m o f t h e t w o w i l l g i v e y o u i t ' s t o t a l s c o r e w o r d _ s c o r e s [ w o r d ] = p o s _ s c o r e + n e g _ s c o r e
  12. TRAINING Now that we can extract features from text, we

    can train a classifier. The simplest and most flexible learning algorithm for text classification is Naive Bayes P ( l a b e l | f e a t u r e s ) = P ( l a b e l ) * P ( f e a t u r e s | l a b e l ) / P ( f e a t u r e s ) Simple to compute = fast Assumes feature indipendence = easy to update Supports multiclass = scalable
  13. TRAINING NLTK provides built-in components 1. Train the classifier 2.

    Serialize classifier for later use 3. Train once, use as much as you want > > > f r o m n l t k . c l a s s i f y i m p o r t N a i v e B a y e s C l a s s i f i e r > > > n b _ c l a s s i f i e r = N a i v e B a y e s C l a s s i f i e r . t r a i n ( t r a i n _ f e a t s ) . . . w a i t a l o t o f t i m e > > > n b _ c l a s s i f i e r . l a b e l s ( ) [ ' n e g ' , ' p o s ' ] > > > s e r i a l i z e r . d u m p ( n b _ c l a s s i f i e r , f i l e _ h a n d l e )
  14. USING THE CLASSIFIER # L o a d t h

    e c l a s s i f i e r f r o m t h e s e r i a l i z e d f i l e c l a s s i f i e r = p i c k l e . l o a d s ( c l a s s i f i e r _ f i l e . r e a d ( ) ) # P i c k a n e w p h r a s e n e w _ p h r a s e = " A t P y c o n I t a l y ! L o v e t h e f o o d a n d t h i s s p e a k e r i s s o a m a z i n g " # 1 ) P r e p r o c e s s i n g f e a t u r e _ v e c t o r = f e a t u r e _ e x t r a c t o r . e x t r a c t ( p h r a s e ) # 2 ) D i m e n s i o n a l i t y r e d u c t i o n , b e s t _ f e a t u r e s i s o u r s e t o f b e s t w o r d s r e d u c e d _ f e a t u r e _ v e c t o r = r e d u c e _ f e a t u r e s ( f e a t u r e _ v e c t o r , b e s t _ f e a t u r e s ) # 3 ) C l a s s i f y ! p r i n t s e l f . c l a s s i f i e r . c l a s s i f y ( r e d u c e d _ f e a t u r e _ v e c t o r ) > > > " p o s "
  15. BUILDING A CLASSIFICATION API Classifier is slow, no matter how

    much optimization is made Classifier is a blocking process, API must be event-driven
  16. BUILDING A CLASSIFICATION API ZEROMQ Fast, uses native sockets Promotes

    horizontal scalability Language-agnostic framework
  17. BUILDING A CLASSIFICATION API ZEROMQ . . . s o

    c k e t = c o n t e x t . s o c k e t ( z m q . R E P ) . . . w h i l e T r u e : m e s s a g e = s o c k e t . r e c v ( ) p h r a s e = j s o n . l o a d s ( m e s s a g e ) [ " t e x t " ] # 1 ) F e a t u r e e x t r a c t i o n f e a t u r e _ v e c t o r = f e a t u r e _ e x t r a c t o r . e x t r a c t ( p h r a s e ) # 2 ) D i m e n s i o n a l i t y r e d u c t i o n , b e s t _ f e a t u r e s i s o u r s e t o f b e s t w o r d s r e d u c e d _ f e a t u r e _ v e c t o r = r e d u c e _ f e a t u r e s ( f e a t u r e _ v e c t o r , b e s t _ f e a t u r e s ) # 3 ) C l a s s i f y ! r e s u l t = c l a s s i f i e r . c l a s s i f y ( r e d u c e d _ f e a t u r e _ v e c t o r ) s o c k e t . s e n d ( j s o n . d u m p s ( r e s u l t ) )
  18. POST-MORTEM Real-time sentiment analysis APIs can be implemented, and can

    be scalable What if we use Redis instead of having serialized classifiers? Deep learning is giving very good results in NLP, let's try it!