Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

MACHINE LEARNING AS A SERVICE MAKING SENTIMENT PREDICTIONS IN REALTIME
WITH ZMQ AND NLTK

ABOUT ME

DISSERTATION Let's make something cool!

SOCIAL MEDIA + MACHINE LEARNING + API

SENTIMENT ANALYSIS AS A SERVICE A STEP-BY-STEP GUIDE

Fundamental Topics Machine Learning Natural Language Processing Overview of the
platform The process Prepare Analyze Train Use Scale

MACHINE LEARNING WHAT IS MACHINE LEARNING? A method of teaching
computers to make and improve predictions or behaviors based on some data. It allow computers to evolve behaviors based on empirical data Data can be anything Stock market prices Sensors and motors email metadata

SUPERVISED MACHINE LEARNING SPAM OR HAM

NATURAL LANGUAGE PROCESSING WHAT IS NATURAL LANGUAGE PROCESSING? Interactions between
computers and human languages Extract information from text Some NLTK features Bigrams Part-or-speech Tokenization Stemming WordNet lookup

NATURAL LANGUAGE PROCESSING SOME NLTK FEATURES Tokentization Stopword Removal >
> > p h r a s e = " I w i s h t o b u y s p e c i f i e d p r o d u c t s o r s e r v i c e " > > > p h r a s e = n l p . t o k e n i z e ( p h r a s e ) > > > p h r a s e [ ' I ' , ' w i s h ' , ' t o ' , ' b u y ' , ' s p e c i f i e d ' , ' p r o d u c t s ' , ' o r ' , ' s e r v i c e ' ] > > > p h r a s e = n l p . r e m o v e _ s t o p w o r d s ( t o k e n i z e d _ p h r a s e ) > > > p h r a s e [ ' I ' , ' w i s h ' , ' b u y ' , ' s p e c i f i e d ' , ' p r o d u c t s ' , ' s e r v i c e ' ]

SENTIMENT ANALYSIS

CLASSIFYING TWITTER SENTIMENT IS HARD Improper language use Spelling mistakes
160 characters to express sentiment Different types of english (US, UK, Pidgin) Gr8 picutre..God bless u RT @WhatsNextInGosp: Resurrection Sunday Service @PFCNY with @Donnieradio pic.twitter.com/nOgz65cpY5 7:04 PM - 21 Apr 2014 Donnie McClurkin @Donnieradio Follow 8 RETWEETS 36 FAVORITES

BACK TO BUILDING OUR API .. FINALLY!

CLASSIFIER 3 STEPS

THE DATASET SENTIMENT140 160.000 labelled tweets CSV format Polarity of
the tweet (0 = negative, 2 = neutral, 4 = positive) The text of the tweet (Lyx is cool)

FEATURE EXTRACTION How are we going to find features from
a phrase? "Bag of Words" representation m y _ p h r a s e = " T o d a y w a s s u c h a r a i n y a n d h o r r i b l e d a y " I n [ 1 2 ] : f r o m n l t k i m p o r t w o r d _ t o k e n i z e I n [ 1 3 ] : w o r d _ t o k e n i z e ( m y _ p h r a s e ) O u t [ 1 3 ] : [ ' T o d a y ' , ' w a s ' , ' s u c h ' , ' a ' , ' r a i n y ' , ' a n d ' , ' h o r r i b l e ' , ' d a y ' ]

FEATURE EXTRACTION CREATE A PIPELINE OF FEATURE EXTRACTORS F O
R M A T T E R = f o r m a t t i n g . F o r m a t t e r P i p e l i n e ( f o r m a t t i n g . m a k e _ l o w e r c a s e , f o r m a t t i n g . s t r i p _ u r l s , f o r m a t t i n g . s t r i p _ h a s h t a g s , f o r m a t t i n g . s t r i p _ n a m e s , f o r m a t t i n g . r e m o v e _ r e p e t i t o n s , f o r m a t t i n g . r e p l a c e _ h t m l _ e n t i t i e s , f o r m a t t i n g . s t r i p _ n o n c h a r s , f u n c t o o l s . p a r t i a l ( f o r m a t t i n g . r e m o v e _ n o i s e , s t o p w o r d s = s t o p w o r d s . w o r d s ( ' e n g l i s h ' ) + [ ' r t ' ] ) , f u n c t o o l s . p a r t i a l ( f o r m a t t i n g . s t e m _ w o r d s , s t e m m e r = n l t k . s t e m . p o r t e r . P o r t e r S t e m m e r ( ) ) )

FEATURE EXTRACTION PASS THE REPRESENTATION DOWN THE PIPELINE I n
[ 1 1 ] : f e a t u r e _ e x t r a c t o r . e x t r a c t ( " T o d a y w a s s u c h a r a i n y a n d h o r r i b l e d a y " ) O u t [ 1 1 ] : { ' d a y ' : T r u e , ' h o r r i b l ' : T r u e , ' r a i n i ' : T r u e , ' t o d a y ' : T r u e } The result is a dictionary of variable length, containing keys as features and values as always True

DIMENSIONALITY REDUCTION Remove features that are common across all classes
(noise) Increase performance of the classifier Decrease the size of the model, less memory usage and more speed

DIMENSIONALITY REDUCTION CHI-SQUARE TEST

NLTK gives us BigramAssocMeasures.chi_sq DIMENSIONALITY REDUCTION CHI-SQUARE TEST # C
a l c u l a t e t h e n u m b e r o f w o r d s f o r e a c h c l a s s p o s _ w o r d _ c o u n t = l a b e l _ w o r d _ f d [ ' p o s ' ] . N ( ) n e g _ w o r d _ c o u n t = l a b e l _ w o r d _ f d [ ' n e g ' ] . N ( ) t o t a l _ w o r d _ c o u n t = p o s _ w o r d _ c o u n t + n e g _ w o r d _ c o u n t # F o r e a c h w o r d a n d i t ' s t o t a l o c c u r a n c e f o r w o r d , f r e q i n w o r d _ f d . i t e r i t e m s ( ) : # C a l c u l a t e a s c o r e f o r t h e p o s i t i v e c l a s s p o s _ s c o r e = B i g r a m A s s o c M e a s u r e s . c h i _ s q ( l a b e l _ w o r d _ f d [ ' p o s ' ] [ w o r d ] , ( f r e q , p o s _ w o r d _ c o u n t ) , t o t a l _ w o r d _ c o u n t ) # C a l c u l a t e a s c o r e f o r t h e n e g a t i v e c l a s s n e g _ s c o r e = B i g r a m A s s o c M e a s u r e s . c h i _ s q ( l a b e l _ w o r d _ f d [ ' n e g ' ] [ w o r d ] , ( f r e q , n e g _ w o r d _ c o u n t ) , t o t a l _ w o r d _ c o u n t ) # T h e s u m o f t h e t w o w i l l g i v e y o u i t ' s t o t a l s c o r e w o r d _ s c o r e s [ w o r d ] = p o s _ s c o r e + n e g _ s c o r e

TRAINING Now that we can extract features from text, we
can train a classifier. The simplest and most flexible learning algorithm for text classification is Naive Bayes P ( l a b e l | f e a t u r e s ) = P ( l a b e l ) * P ( f e a t u r e s | l a b e l ) / P ( f e a t u r e s ) Simple to compute = fast Assumes feature indipendence = easy to update Supports multiclass = scalable

TRAINING NLTK provides built-in components 1. Train the classifier 2.
Serialize classifier for later use 3. Train once, use as much as you want > > > f r o m n l t k . c l a s s i f y i m p o r t N a i v e B a y e s C l a s s i f i e r > > > n b _ c l a s s i f i e r = N a i v e B a y e s C l a s s i f i e r . t r a i n ( t r a i n _ f e a t s ) . . . w a i t a l o t o f t i m e > > > n b _ c l a s s i f i e r . l a b e l s ( ) [ ' n e g ' , ' p o s ' ] > > > s e r i a l i z e r . d u m p ( n b _ c l a s s i f i e r , f i l e _ h a n d l e )

USING THE CLASSIFIER # L o a d t h
e c l a s s i f i e r f r o m t h e s e r i a l i z e d f i l e c l a s s i f i e r = p i c k l e . l o a d s ( c l a s s i f i e r _ f i l e . r e a d ( ) ) # P i c k a n e w p h r a s e n e w _ p h r a s e = " A t P y c o n I t a l y ! L o v e t h e f o o d a n d t h i s s p e a k e r i s s o a m a z i n g " # 1 ) P r e p r o c e s s i n g f e a t u r e _ v e c t o r = f e a t u r e _ e x t r a c t o r . e x t r a c t ( p h r a s e ) # 2 ) D i m e n s i o n a l i t y r e d u c t i o n , b e s t _ f e a t u r e s i s o u r s e t o f b e s t w o r d s r e d u c e d _ f e a t u r e _ v e c t o r = r e d u c e _ f e a t u r e s ( f e a t u r e _ v e c t o r , b e s t _ f e a t u r e s ) # 3 ) C l a s s i f y ! p r i n t s e l f . c l a s s i f i e r . c l a s s i f y ( r e d u c e d _ f e a t u r e _ v e c t o r ) > > > " p o s "

BUILDING A CLASSIFICATION API Classifier is slow, no matter how
much optimization is made Classifier is a blocking process, API must be event-driven

BUILDING A CLASSIFICATION API SCALING TOWARDS INFINITY AND BEYOND

BUILDING A CLASSIFICATION API ZEROMQ Fast, uses native sockets Promotes
horizontal scalability Language-agnostic framework

BUILDING A CLASSIFICATION API ZEROMQ . . . s o
c k e t = c o n t e x t . s o c k e t ( z m q . R E P ) . . . w h i l e T r u e : m e s s a g e = s o c k e t . r e c v ( ) p h r a s e = j s o n . l o a d s ( m e s s a g e ) [ " t e x t " ] # 1 ) F e a t u r e e x t r a c t i o n f e a t u r e _ v e c t o r = f e a t u r e _ e x t r a c t o r . e x t r a c t ( p h r a s e ) # 2 ) D i m e n s i o n a l i t y r e d u c t i o n , b e s t _ f e a t u r e s i s o u r s e t o f b e s t w o r d s r e d u c e d _ f e a t u r e _ v e c t o r = r e d u c e _ f e a t u r e s ( f e a t u r e _ v e c t o r , b e s t _ f e a t u r e s ) # 3 ) C l a s s i f y ! r e s u l t = c l a s s i f i e r . c l a s s i f y ( r e d u c e d _ f e a t u r e _ v e c t o r ) s o c k e t . s e n d ( j s o n . d u m p s ( r e s u l t ) )

POST-MORTEM Real-time sentiment analysis APIs can be implemented, and can
be scalable What if we use Redis instead of having serialized classifiers? Deep learning is giving very good results in NLP, let's try it!

FIN QUESTIONS

Machine Learning as a Service: making sentiment...

Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

More Decks by Daniel Pyrathon

Other Decks in Programming

Featured

Transcript