Deep Learning Techniques and Representations for Sentiment Analysis of Movie Reviews

Deep Learning Techniques and Representations for Sentiment Analysis of Movie
Reviews ANKIT BAHUGUNA M.SC. INFORMATICS, TU MÜNCHEN

Outline • Motivation • Problem Statement • Sentiment Analysis -
Overview • Deep Learning – Overview • Background Study • Representations • Bag of Words • Word2Vec • Polyglot • GloVe • Experiment - Setup • Experiments and Feature Engineering • Results • Observations • Future Work • References

Motivation The overall goal in taking up this project in
area of Computational Linguistics is to learn and apply the core concepts involved in developing system facilitating the human machine interaction by processing natural language. Specifically, I chose Sentiment Analysis as it has a high impact, from Movie or Product reviews to Stock Market prediction, the scope of the problem is large and highly useful. Ultimately, the task is to develop a high accuracy Sentiment Analyzer system, which can be applicable to a large corpus of domains. And also to explore representations learned through unsupervised deep learning methods.

Problem Statement The Rotten Tomatoes movie review dataset is a
corpus of movie reviews used for sentiment analysis, originally collected by Pang and Lee [1]. In their work on sentiment tree banks, Socher et al. [2] used Amazon's Mechanical Turk to create fine-grained labels for all parsed phrases in the corpus. The task is to label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive. Obstacles like sentence negation, sarcasm, terseness, language ambiguity, and many others make this task very challenging.

Sentiment Analysis - Overview Sentiment analysis (also known as opinion
mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level — whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. Advanced, "beyond polarity" sentiment classification looks, for instance, at emotional states such as "angry," "sad," and "happy."

Example – Movie Reviews ▪ Even fans of Ismail Merchant
's work , I suspect , would have a hard time sitting through this one . (Slightly Negative) ▪ have a hard time sitting through this one (Negative) ▪ Have (Neutral) ▪ A comedy-drama of nearly epic proportions rooted in a sincere performance by the title character undergoing midlife crisis . (Positive) ▪ A comedy-drama of nearly epic proportions (Slightly Positive)

Deep Learning - Overview Deep Learning is a class of
machine learning training algorithms that: 1. Use many layers/levels of non-linear processing units for feature extraction and transformation, where higher level features are derived from lower level features to form hierarchical representation. 2. Layers that have been used in deep learning include hidden layers of an artificial neural network, restricted Boltzmann machines etc. 3. Various deep learning architectures such as deep neural networks, convolutional deep neural networks, and deep belief networks have been applied to fields like computer vision, automatic speech recognition, natural language processing, and music/audio signal recognition where they have been shown to produce state-of-the-art results on various tasks.

Background Study ❑ Language Modelling ❑ Language Modeling using N-Grams
❑ Word Representations ❑ Language Modelling using Neural Networks (NN) ❑ Language Modelling using Recurrent Neural Networks (RNN) ❑ Recursive Neural Tensor Networks (RNTN) ❑ Domain Adaptation

Language Modelling • A language model is a probabilistic model
that assigns probabilities to any sequence of words p(w1 , ... ,wT ) • Language modeling is the task of learning a language model that assigns high probabilities to well formed sentences. We take into consideration: nth order Markov assumption, where we assume that the i th word was generated based only on the n-1 previous words. • Plays a crucial role in speech recognition and machine translation systems.   Example: Translation of a Hindi sentence to English. एक अच्छा लड़का {Ek Acha Ladka} A good boy. A boy good.

Language Modelling using N-Grams

WORD REPRESENTATIONS In context of current problem, we use usupervised
deep learning techniques, to learn a word representation C(w) which is a continuous vector and is both syntactically and semantically similar. More precisely, we learn a continuous representation of words and would like the distance ||C(w)-C(w’)|| to reflect meaningful similarity between words w and w’. Chiefly, we explore the following word representations: Word2Vec, Polyglot, GloVe and their concatenated combinations along with Bag of Words representation and compare their results for the task of sentiment analysis.

Language Modelling using NN • N-previous words are encoded using
1-of-V coding • Words are projected by a linear operation on the projection layer • Softmax function is used at the output layer to ensure that 0 <= p <= 1 • Weights learnt using back-propagation • Complexity/example =N*D + N*D*H + H*V Image Courtesy: Thomas Mikolov

Language Modelling using Recurrent NN • No need to specify
the context length • No projection layer • Hidden layer of the previous layer connects to the hidden layer of the next word. • Some kind of a short term memory which has information about the history. • Complexity/example = H * H + H * V Image Courtesy: Tomas Mikolov

Recursive Neural Tensor Networks A new composition function ‘p’ was
introduced in a new compositional model called the RNTN, along with a new sentiment tree-bank, which allows training and evaluation with compositional information. More expressive than any other recursive neural network so far! Idea: Allow more interaction of Vectors. Image Courtesy: Socher et al. 2013 EMNLP

Domain Adaptation Issue of Domain Dependency: A classifier trained using
opinionated documents from domain A often performs poorly when tested on documents from domain B Solution: Domain Adaptation [Blitzer et al, ACL 2007] Step 1. Use labeled data from one domain and unlabeled data from both source the target domain and general opinion words as features. Step 2. Choose a set of pivot features which occur frequently in both domains Step 3. Model correlations between the pivot features and all other features by training linear pivot predictors to predict occurrences of each pivot in the unlabeled data from both domains.

Representations I. Bag of Words (B) II. Word2Vec (W)– CBOW
and Skip Models III. Polyglot (P) IV. Glove (G)

Representation - I: Bag of Words In this model, a
text (such as a sentence or a document) is represented as the bag (multi-set) of its words, disregarding grammar and even word order but keeping multiplicity. Example: D1: John likes to watch movies. Mary likes movies too. D2: John also likes to watch football games. Vocabulary {Word : Index} { "John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6, "football": 7, "games": 8, "Mary": 9, "too": 10 } There are 10 distinct words and using the indexes of the Vocabulary , each document is represented by a 10-entry vector: [1, 2, 1, 1, 2, 0, 0, 0, 1, 1] [1, 1, 1, 1, 0, 1, 1, 1, 0, 0] Note: Scikit-Learn has direct support this vector representation using a CountVectorizer. Similarly support is available for TF-IDF too. In our experiments we use CountVectorizer.

Representation - II: Word2Vec Mikolov T. et al. 2013, proposes
two novel model architectures for computing continuous vector representations of words from very large datasets. They are: ▪ Continous Bag of Words (cbow) ▪ Continous Skip Gram (skip) Word2Vec focuses on distributed representations learned by neural networks. All models are trained using stochastic gradient descent and back propagation. For all models, the training complexity is proportional to: O = E x T x Q, Where, E: # of Training epochs; T = # Words in Training Set; Q = Defined further for each model architecture.

Word2Vec - CBOW and SKIP • CBOW: Predicts the current
word based on the context. • Similar to feed-forward neural network language model, where the non linear hidden layer is removed and projection layer is shared for all words (not just the projection matrix); thus all words are projected into the same position (their vectors are averaged). • Best performance, by building the log linear classifier with four future and four history words as input, where training criteria is to correctly classify the current (middle) word. • SKIP: Tries to maximize classification of word based on another word in the same sentence. • We use each current word as input to a log linear classifier with continuous projection layer and predicts words within a certain range before and after current word. • Example: The analogy “king is to queen as man is to woman” should be encoded in the vector space by the vector equation: king - queen = man - woman

CBOW and SKIP Models Mikolov T. et al. 2013

Representation - III: Polyglot • Al-Rfou’ et. al. 2013, trained
word embeddings for more than 100 languages using their corresponding Wikipedia(s). Ex. Words nearest neighbors as they appear in the English embeddings. • Quantitatively demonstrated the utility of their word embeddings by using them as the sole features for training a part of speech tagger for a subset of these languages. Image: Al-Rfou’ et. al. 2013

Polyglot – NN Architecture • Words are retrieved from embeddings
matrix C and concatenated at the projection layer as an input to compute the hidden layer activation. The score is the linear combination of the activation values of the hidden layer. • The scores of two phrases are ranked according to hinge loss to distinguish the corrupted phrase from the original one.

Representation - IV: GloVe • Pennington et al 2014, introduces
a new global log bilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization (LSA) and local context window methods (Skip-Gram Model). • Training only on the non-zero elements in a word-word co-occurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. • Proposes a specific weighted least squares model that trains on global word-word co- occurrence counts and thus makes efficient use of statistics (word occurrence in a corpus). • Demonstrates performance on word similarity and Named Entity Recognition (NER) tasks.

GloVe – Co-occurrence Probability Example: Concept of thermodynamic phase, for
which we might take i = ice and j = steam. The relationship of these words can be examined by studying the ratio of their co-occurrence probabilities with various probe words, k. For words k related to ice but not steam, say k = solid, we expect the ratio Pik /Pjk will be large. Similarly, for words k related to steam but not ice, say k = gas, the ratio should be small. For words k like water or fashion, that are either related to both ice and steam, or to neither, the ratio should be close to one.   Table: Co-occurrence probabilities for target words ice and steam with selected context words from 6 billion token corpus Only in the ratio does noise from non-discriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific of steam. Image Courtesy: Pennington J. et al. 2014

Experiment - Data-set Dataset – Overview ◦ In-Domain Data, originally
from Rotten Tomatoes. ◦ Training Data: Kaggle Movie Reviews 156,060 Phrases ◦ Testing Data: Kaggle Movie Reviews 66,292 Phrases ◦ Out - of - Domain Data: Google News (2.7 GB) ◦ Domain Adapted Data: Amazon (Book+DVD Reviews) + Google News + Kaggle In-domain Training Data (3.6 GB) ◦ Task: Classify test phrases into one of the five categories:   negative (1), somewhat negative (2), neutral (3), somewhat positive (4) and positive (5).

Experiment Statistics and Setup Dataset – Feature Engineering (Concatenation of
Feature Vectors) ◦ Experiment Statistics: 32 experiments; 4 basic representations: Bag of Words, Word2Vec, Polyglot and GloVe; Concatenation of representations in all possible combinations, without repetition. Server Setup: LMU – Omega (B;W;G) and Calculus (P) ; Python 2.7, Java 8, Sci-kit Learn and Shell Scripts. Classification Algorithm: Lib-Linear, linear Support Vector Machine Scikit Learn Implementation. Evaluation: The output of task was submitted to Kaggle Server in a predefined format for evaluation.

Experiment Steps Steps Involved: Data Preparation + Feature Engineering: ◦
Lowercase; Stop Word Removal and L2 Normalization. ◦ TF-IDF and Count Vectorizer (or, Bag of Words counts) ◦ Normalization of Vectors ◦ Retrieving Word2Vec, Polyglot and GloVe to generate individual vectors trained over indomain and out-domain data. ◦ Computing Centroid of Vectors and creating a lookup table with <word : centroid weight> ◦ Re-computing Count Vectorizers for Training and Testing Data using Word2Vec, Polyglot and Glove centroids. ◦ Train data is fetched to a Lib-Linear SVM and output is obtained in pre-defined format!

Learned Word Vectors in Bag Of Words Representation Results

Results and Evaluation - I S. No. Representation In Domain
  (Movie Reviews) Out Domain (Google News) 1 Bag of Words (B) 0.61175 0.60844 2 Word2Vec (W) 0.60535 0.61140 3 Polyglot (P) 0.60516 0.61413 4 B + W 0.62155 0.62116 5 W + P 0.61470 0.62201 6 P + B 0.62172 0.62483 7 B + W + P 0.62276 0.62498 Table: Accuracy computed over Test Data (provided by Kaggle contest: Sentiment Analysis of Movie Reviews) Additional Result: Dataset: Google News + Amazon (Books + DVD) (Blitzer et al 2007) + Movie (Kaggle) : B + W + P Accuracy: 0.62489 Everything Neutral Benchmark: 0.51789

Results and Evaluation - II S. No. Representation In Domain
  (Movie Reviews) Out Domain (Google News) 1 GloVe (G) 0.59959 0.60448 2 W + G 0.61472 0.61905 3 P + G 0.61569 0.62018 4 B + G 0.62226 0.61697 5 W + P + G 0.61896 0.62237 6 P + B + G 0.62353 0.62386 7 B + W + G 0.62249 0.62143 8 B + W + P + G 0.62169 0.62348 Table: Accuracy computed over Test Data (provided by Kaggle contest: Sentiment Analysis of Movie Reviews) Additional Result: Dataset: Google News + Amazon (Books + DVD) (Blitzer et al 2007) + Movie (Kaggle) : B + W + P + G Accuracy: 0.62370 Everything Neutral Benchmark: 0.51789

Centroid Based Sentence Representation Results

Results and Evaluation - III S. No. Representation In Domain
  (Movie Reviews) Out Domain (N+A+K) 1 Bag of Words (B) 0.61175 0.60858 2 Word2Vec (W) 0.51789 0.51789 3 Polyglot (P) 0.51789 0.51789 4 B + W 0.60988 0.58802 5 W + P 0.51789 0.51789 6 P + B 0.60840 0.60648 7 B + W + P 0.60368 0.56937 Table: Accuracy computed over Test Data (provided by Kaggle contest: Sentiment Analysis of Movie Reviews) N: Google News; A: Amazon multi- Domain Sentiment Dataset; K: Kaggle movie review dataset. Everything Neutral Benchmark: 0.51789

Results and Evaluation - IV S. No. Representation In Domain
  (Movie Reviews) Out Domain (N+A+K) 1 GloVe (G) 0.51762 0.51762 2 W + G 0.51780 0.51780 3 P + G 0.51788 0.51788 4 B + G 0.44686 0.45545 5 W + P + G 0.51789 0.51789 6 P + B + G 0.52365 0.52717 7 B + W + G 0.52472 0.56241 8 B + W + P + G 0.53935 0.56044 Table: Accuracy computed over Test Data (provided by Kaggle contest: Sentiment Analysis of Movie Reviews) N: Google News; A: Amazon multi- Domain Sentiment Dataset; K: Kaggle movie review dataset. Everything Neutral Benchmark: 0.51789

Observations 1. BEST RESULT: The concatenation of BOW + Polyglot
+ W2V with Google News i.e., out domain data without training data gives the best accuracy i.e. 0.62498 2. WORST RESULT: GloVe Representation using in-domain movie review dataset provided by Kaggle gives lowest accuracy of 0.59959 3. Among the 4 individual representation, Bag of Words (0.61175) outperforms when training over a relatively small in-domain Kaggle movie review dataset. But, training on large dataset like Google News, Polyglot (0.61413) marginally out-performs others. 4. Concatenation of vector representations brought significant improvements. 5. Domain adaptation also improves accuracy. It was found that concatenated set:   (B+W+P) > (B+W+P+G), when trained over Amazon + News + Kaggle Dataset. 6. Current Kaggle Rank with this result: 140

Future Work ▪ Explore Recursive Auto Encoders; chiefly MSDA (Marginalized
stacked de-noising Auto-encoders) algorithm (Chen et. al, 2012) and apply it to task of sentiment analysis. ▪ Our current study revolves around Deep Belief Networks, exploring other deep networks could prove to be insightful. ▪ Explore the impact on languages other than English. ▪ Testing over large and small training sets of similar and different domains in various languages. ▪ Proposing a new technique, which can improve the accuracy over this task significantly.

References B. Pang and L. Lee Opinion Mining and Sentiment
Analysis. Foundations and trends in Information Retrieval 2(1-2), pp 1-135, 2008 Yongzheng Zhang, Dan Shen and Catherine Baudin Sentiment Analysis in Practice, Tutorial delivered at ICDM 2011 Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng and Christopher Potts Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, Conference on Empirical Methods in Natural Language Processing (EMNLP 2013) Rami Al-Rfou, Bryan Perozzi, and Steven Skiena Polyglot: Distributed Word Representations for Multilingual NLP Seventeenth Conference on Computational Natural Language Learning (CoNLL 2013) Jeffrey Pennington , Richard Socher , Christopher D. Manning Glove: global vectors for word representation Empirical Methods in Natural Language Processing (EMNLP), 2014 John Blitzer, Mark Dredze, Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. Association of Computational Linguistics (ACL), 2007. Minmin Chen, Zhixiang (Eddie) Xu, Kilian Q. Weinberger, Fei Sha. Marginalized Stacked Denoising Autoencoders for Domain Adaptation. Proceedings of 29th International Conference on Machine Learning (ICML), Edingburgh Scotland, Omnipress, pages 767-774, 2012

THANK YOU! Email: [email protected]

Deep Learning Techniques and Representations f...

Deep Learning Techniques and Representations for Sentiment Analysis of Movie Reviews

More Decks by Ankit Bahuguna

Other Decks in Research

Featured

Transcript