Sentiment Analysis: Machine Learning with Python Scikit Learn

Slide 1

Slide 1 text

Sentiment Analysis: Machine Learning with Python Scikit-Learn ANKIT BAHUGUNA (@codekee) TU MUNICH / LMU , TERADATA, MOZILLA I ❤ PLAYING WITH DATA

Slide 2

Slide 2 text

Game of Thrones Season-4 DVD (Amazon) Source: Amazon.com

Slide 3

Slide 3 text

Reviews – What do the people say? Valar Morghulis! (All men must die) Valar Dohaeris! (All men must serve) Love this series! Great actors and I love the characters. I am always on pins and needles waiting for each new season of this show! I think this is a fantastic series. Although the fourth year was not as exciting to me as the first three, I still look forward to seasons 5 and 6. I felt this season was not the strongest of the series. Love the series, a little disappointed that it will have to end one day!

Slide 4

Slide 4 text

Apple Watch http://blogs-images.forbes.com/anthonykosner/files/2014/10/apple-watch-selling-points.jpg

Slide 5

Slide 5 text

In the 24 hours since the launch of the Apple Watch on 9 March, Hotwire’s social media analysis picked up 981,021 mentions of the device using the terms Apple Watch, #AppleWatch, and #AppleWatchEvent. Of these mentions, a massive 42 per cent were found to contain negative sentiment towards the devices – 58 per cent was however positive. Source: http://www.thedrum.com/news/2015/03/10/apple-watch-sees-42-negative-response-twitter-battery-life-and-price-being-main

Slide 6

Slide 6 text

Sentiment Analysis A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/ aspect level — whether the expressed opinion in a document, a sentence or an entity feature/ aspect is positive, negative, or neutral. Advanced, "beyond polarity" sentiment classification looks, for instance, at emotional states such as "angry," "sad," and "happy."

Slide 7

Slide 7 text

Machine Learning: Classification In classification, we use an object's characteristics to identify which class (or group) it belongs to. Source: Wikipedia

Slide 8

Slide 8 text

Machine Learning: Clustering Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). Source: Wikipedia

Slide 9

Slide 9 text

Machine Learning: Supervised vs Unsupervised Supervised learning is the machine learning task of inferring a function from labeled training data TRAINING DATA: LABELED DATA whereas, The problem of unsupervised learning is that of trying to find hidden structure in unlabeled data. TRAINING DATA: UNLABELED DATA Today we will try to solve the problem of Sentiment Analysis of Movie Reviews via supervised learning approach using Python and Scikit-Learn.

Slide 10

Slide 10 text

Getting the Data https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data

Slide 11

Slide 11 text

Information about Kaggle Data-set ◦ In-Domain Data, originally from Rotten Tomatoes. ◦ Training Data: Kaggle Movie Reviews 156,060 Phrases ◦ Testing Data: Kaggle Movie Reviews 66,292 Phrases ◦ Task: Classify test phrases into one of the five categories: ◦ negative (0), ◦ somewhat negative (1), ◦ neutral (2), ◦ somewhat positive (3) ◦ positive (4).

Slide 12

Slide 12 text

Data Format – Tab Separated Values Input Training Data PhraseId SentenceId Phrase Sentiment 64 2 This quiet , introspective and entertaining independent is worth seeking . 4 Input Testing Data PhraseId SentenceId Phrase 156250 8550 All ends well , sort of , but the frenzied comic moments never click . Output – Analyzed Test Data (Comma Separated) PhraseId,Sentiment 156061,2

Slide 13

Slide 13 text

Steps ◦ Lowercase the input text; ◦ Stop Word Removal (a, an , the etc.) from Text ◦ TF-IDF or Count Vectorizer (or, Bag of Words counts) ◦ Normalization of Vectors(L2) ◦ Training data is fetched to a Lib-Linear SVM and output is obtained in pre-defined format!

Slide 14

Slide 14 text

Representation: Bag of Words In this model, a text (such as a sentence or a document) is represented as the bag (multi-set) of its words, disregarding grammar and even word order but keeping multiplicity. Example: D1: John likes to watch movies. Mary likes movies too. D2: John also likes to watch football games. Vocabulary {Word : Index} { "John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6, "football": 7, "games": 8, "Mary": 9, "too": 10 } There are 10 distinct words and using the indexes of the Vocabulary , each document is represented by a 10-entry vector: [1, 2, 1, 1, 2, 0, 0, 0, 1, 1] [1, 1, 1, 1, 0, 1, 1, 1, 0, 0] Note: Scikit-Learn has direct support this vector representation using a CountVectorizer. Similarly support is available for TF-IDF too.

Slide 15

Slide 15 text

Loading Data vectorizer = CountVectorizer(stop_words = 'english', min_df = 10, lowercase=True , dtype=numpy.float64) corpus = [] target = [] testdataset = [] phraseid = [] sentiment = []

Slide 16

Slide 16 text

Loading Data From Files #Training Data with open(‘PATH_TO_INPUT_TRAINING_FILE’) as f: for line in f: component = line.split('\t') corpus.append(component[2].rstrip('\n')) target.append(component[3].rstrip('\n')) #Test Data with open(‘PATH_TO_INPUT_TESTING_FILE’) as f_test: for line_test in f_test: component_test = line_test.split('\t') phraseid.append(component_test[0].rstrip('\n')) testdataset.append(component_test[2].rstrip('\n'))

Slide 17

Slide 17 text

Training Data: Fit Transform and Normalization # InputData FitTransform and Normalized InputData_raw = vectorizer.fit_transform(corpus) InputData = preprocessing.normalize(InputData_raw,norm='l2')

Slide 18

Slide 18 text

Test Data: Transformed and Normalized # TestData Transformed and Normalized TestData_raw = vectorizer.transform(testdataset) TestData = preprocessing.normalize(TestData_raw,norm='l2')

Slide 19

Slide 19 text

Classification with SVM (Support Vector Machine) # Starting Classification using SVM # Loading Model Parameters clf_model = svm.LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0) # One can also load model using default parameters: # svm.LinearSVC() Note: There are a number of machine learning classification models ex. Linear Regression, Decision Trees, Naive Bayes, Neural Networks, Support Vector Machines etc.

Slide 20

Slide 20 text

Fitting ML model to Training Data # fit() : fits the SVM model on the input bag of # words model and the target sentiment labels clf = clf_model.fit(InputData,target)

Slide 21

Slide 21 text

Predicting Sentiment Labels on Test Data # Using the trained machine learning model as performed in last # step, we predict the sentiment labels for the test data out = clf.predict(TestData)

Slide 22

Slide 22 text

Writing the output to file in given format. sentiment = out.tolist() #Final Output file written to disk file = open(“PATH_TO_OUTPUT_FILE", "w") file.write('PhraseId,Sentiment\n') for i,j in zip(phraseid , sentiment): val = '{0},{1}\n'.format(i, j) file.write(val) file.close()

Slide 23

Slide 23 text

The final output file phraseId,Sentiment 156348,2 (Neutral) 156349,3 (Somewhat Positive) 156350,1 (Somewhat Negative) 156351,2 (Neutral) 156352,4 (Positive)

Slide 24

Slide 24 text

And we are done  https://hayleyandjoelblog.files.wordpress.com/2015/02/hurray.png

Slide 25

Slide 25 text

Or wait… Are we? http://www.stepupleader.com/wp-content/uploads/2013/06/curious.jpg

Slide 26

Slide 26 text

Conclusion This is just a first baby step in learning data science. Kaggle hosts many interesting problems where you can try out and practice and learn about this rapidly growing area. In real world data science, no single model fits all problems, so one needs to constantly learn about new techniques. Real world data is notorious and one constantly faces new challenges to handle new problems. Python Scikit-Learn is a handy tool to achieve this task. My verdict: Try it out and get fascinated with world of Data Science.

Slide 27

Slide 27 text

Links and References B. Pang and L. Lee Opinion Mining and Sentiment Analysis. Foundations and trends in Information Retrieval 2(1-2), pp 1- 135, 2008 Yongzheng Zhang, Dan Shen and Catherine Baudin Sentiment Analysis in Practice, Tutorial delivered at ICDM 2011 Scikit Learn Supervised Learning: http://scikit-learn.org/stable/supervised_learning.html#supervised-learning Scikit Learn Working with Text - http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html Andrew Ng’s Machine Learning Course: https://www.coursera.org/course/ml Manning and Jurafsky, Natural Language Processing Course https://www.coursera.org/course/nlp Learning Scikit-Learn: Machine Learning in Python http://www.amazon.com/Learning-scikit-learn-Machine- Python/dp/1783281936

Slide 28

Slide 28 text

THANK YOU! You can write to me at: [email protected]