Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sentiment Analysis: Machine Learning with Python Scikit Learn

Sentiment Analysis: Machine Learning with Python Scikit Learn

@Foss Asia 2015, Python Track, Singapore.


Ankit Bahuguna

March 14, 2015

More Decks by Ankit Bahuguna

Other Decks in Programming


  1. Sentiment Analysis: Machine Learning with Python Scikit-Learn ANKIT BAHUGUNA (@codekee)

  2. Game of Thrones Season-4 DVD (Amazon) Source: Amazon.com

  3. Reviews – What do the people say? Valar Morghulis! (All

    men must die) Valar Dohaeris! (All men must serve) Love this series! Great actors and I love the characters. I am always on pins and needles waiting for each new season of this show! I think this is a fantastic series. Although the fourth year was not as exciting to me as the first three, I still look forward to seasons 5 and 6. I felt this season was not the strongest of the series. Love the series, a little disappointed that it will have to end one day!
  4. Apple Watch http://blogs-images.forbes.com/anthonykosner/files/2014/10/apple-watch-selling-points.jpg

  5. In the 24 hours since the launch of the Apple

    Watch on 9 March, Hotwire’s social media analysis picked up 981,021 mentions of the device using the terms Apple Watch, #AppleWatch, and #AppleWatchEvent. Of these mentions, a massive 42 per cent were found to contain negative sentiment towards the devices – 58 per cent was however positive. Source: http://www.thedrum.com/news/2015/03/10/apple-watch-sees-42-negative-response-twitter-battery-life-and-price-being-main
  6. Sentiment Analysis A basic task in sentiment analysis is classifying

    the polarity of a given text at the document, sentence, or feature/ aspect level — whether the expressed opinion in a document, a sentence or an entity feature/ aspect is positive, negative, or neutral. Advanced, "beyond polarity" sentiment classification looks, for instance, at emotional states such as "angry," "sad," and "happy."
  7. Machine Learning: Classification In classification, we use an object's characteristics

    to identify which class (or group) it belongs to. Source: Wikipedia
  8. Machine Learning: Clustering Clustering is the task of grouping a

    set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). Source: Wikipedia
  9. Machine Learning: Supervised vs Unsupervised Supervised learning is the machine

    learning task of inferring a function from labeled training data TRAINING DATA: LABELED DATA whereas, The problem of unsupervised learning is that of trying to find hidden structure in unlabeled data. TRAINING DATA: UNLABELED DATA Today we will try to solve the problem of Sentiment Analysis of Movie Reviews via supervised learning approach using Python and Scikit-Learn.
  10. Getting the Data https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data

  11. Information about Kaggle Data-set ◦ In-Domain Data, originally from Rotten

    Tomatoes. ◦ Training Data: Kaggle Movie Reviews 156,060 Phrases ◦ Testing Data: Kaggle Movie Reviews 66,292 Phrases ◦ Task: Classify test phrases into one of the five categories: ◦ negative (0), ◦ somewhat negative (1), ◦ neutral (2), ◦ somewhat positive (3) ◦ positive (4).
  12. Data Format – Tab Separated Values Input Training Data PhraseId

    SentenceId Phrase Sentiment 64 2 This quiet , introspective and entertaining independent is worth seeking . 4 Input Testing Data PhraseId SentenceId Phrase 156250 8550 All ends well , sort of , but the frenzied comic moments never click . Output – Analyzed Test Data (Comma Separated) PhraseId,Sentiment 156061,2
  13. Steps ◦ Lowercase the input text; ◦ Stop Word Removal

    (a, an , the etc.) from Text ◦ TF-IDF or Count Vectorizer (or, Bag of Words counts) ◦ Normalization of Vectors(L2) ◦ Training data is fetched to a Lib-Linear SVM and output is obtained in pre-defined format!
  14. Representation: Bag of Words In this model, a text (such

    as a sentence or a document) is represented as the bag (multi-set) of its words, disregarding grammar and even word order but keeping multiplicity. Example: D1: John likes to watch movies. Mary likes movies too. D2: John also likes to watch football games. Vocabulary {Word : Index} { "John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6, "football": 7, "games": 8, "Mary": 9, "too": 10 } There are 10 distinct words and using the indexes of the Vocabulary , each document is represented by a 10-entry vector: [1, 2, 1, 1, 2, 0, 0, 0, 1, 1] [1, 1, 1, 1, 0, 1, 1, 1, 0, 0] Note: Scikit-Learn has direct support this vector representation using a CountVectorizer. Similarly support is available for TF-IDF too.
  15. Loading Data vectorizer = CountVectorizer(stop_words = 'english', min_df = 10,

    lowercase=True , dtype=numpy.float64) corpus = [] target = [] testdataset = [] phraseid = [] sentiment = []
  16. Loading Data From Files #Training Data with open(‘PATH_TO_INPUT_TRAINING_FILE’) as f:

    for line in f: component = line.split('\t') corpus.append(component[2].rstrip('\n')) target.append(component[3].rstrip('\n')) #Test Data with open(‘PATH_TO_INPUT_TESTING_FILE’) as f_test: for line_test in f_test: component_test = line_test.split('\t') phraseid.append(component_test[0].rstrip('\n')) testdataset.append(component_test[2].rstrip('\n'))
  17. Training Data: Fit Transform and Normalization # InputData FitTransform and

    Normalized InputData_raw = vectorizer.fit_transform(corpus) InputData = preprocessing.normalize(InputData_raw,norm='l2')
  18. Test Data: Transformed and Normalized # TestData Transformed and Normalized

    TestData_raw = vectorizer.transform(testdataset) TestData = preprocessing.normalize(TestData_raw,norm='l2')
  19. Classification with SVM (Support Vector Machine) # Starting Classification using

    SVM # Loading Model Parameters clf_model = svm.LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0) # One can also load model using default parameters: # svm.LinearSVC() Note: There are a number of machine learning classification models ex. Linear Regression, Decision Trees, Naive Bayes, Neural Networks, Support Vector Machines etc.
  20. Fitting ML model to Training Data # fit() : fits

    the SVM model on the input bag of # words model and the target sentiment labels clf = clf_model.fit(InputData,target)
  21. Predicting Sentiment Labels on Test Data # Using the trained

    machine learning model as performed in last # step, we predict the sentiment labels for the test data out = clf.predict(TestData)
  22. Writing the output to file in given format. sentiment =

    out.tolist() #Final Output file written to disk file = open(“PATH_TO_OUTPUT_FILE", "w") file.write('PhraseId,Sentiment\n') for i,j in zip(phraseid , sentiment): val = '{0},{1}\n'.format(i, j) file.write(val) file.close()
  23. The final output file phraseId,Sentiment 156348,2 (Neutral) 156349,3 (Somewhat Positive)

    156350,1 (Somewhat Negative) 156351,2 (Neutral) 156352,4 (Positive)
  24. And we are done  https://hayleyandjoelblog.files.wordpress.com/2015/02/hurray.png

  25. Or wait… Are we? http://www.stepupleader.com/wp-content/uploads/2013/06/curious.jpg

  26. Conclusion This is just a first baby step in learning

    data science. Kaggle hosts many interesting problems where you can try out and practice and learn about this rapidly growing area. In real world data science, no single model fits all problems, so one needs to constantly learn about new techniques. Real world data is notorious and one constantly faces new challenges to handle new problems. Python Scikit-Learn is a handy tool to achieve this task. My verdict: Try it out and get fascinated with world of Data Science.
  27. Links and References B. Pang and L. Lee Opinion Mining

    and Sentiment Analysis. Foundations and trends in Information Retrieval 2(1-2), pp 1- 135, 2008 Yongzheng Zhang, Dan Shen and Catherine Baudin Sentiment Analysis in Practice, Tutorial delivered at ICDM 2011 Scikit Learn Supervised Learning: http://scikit-learn.org/stable/supervised_learning.html#supervised-learning Scikit Learn Working with Text - http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html Andrew Ng’s Machine Learning Course: https://www.coursera.org/course/ml Manning and Jurafsky, Natural Language Processing Course https://www.coursera.org/course/nlp Learning Scikit-Learn: Machine Learning in Python http://www.amazon.com/Learning-scikit-learn-Machine- Python/dp/1783281936
  28. THANK YOU! You can write to me at: ankit.bahuguna@cs.tum.edu