[Literature review] emoji2vec: Learning Emoji Representations from their Description

Literature review: Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko Bošnjak,
Sebastian Riedel. Proceedings of The Fourth International Workshop on Natural Language Processing for Social Media, pages 48–54, 2016 Nagaoka University of Technology VO HUYNH QUOC VIET ➢ Natural Language Processing Laboratory 2018 / 06 / 26 emoji2vec: Learning Emoji Representations from their Description

Introduction 2 • Many current natural language processing applications for
social media pretrained sets of word embeddings but contain few or no emoji representations even as emoji usage in social media has increase. • The Oxford Dictionary named 2015 the year of the emoji, citing an increase in usage of over 800% • Over 10% of Twitter posts and over 50% of text on Instagram contain one or more emoji

Introduction 3 • This paper introduces emoji2vec, pre-trained embeddings for
all Unicode emoji • Embeddings for emoji Unicode symbols learned from their description in the Unicode emoji standard. • Demonstrate the usefulness of emoji representations trained in this way by evaluating on a Twitter sentiment analysis task. • Provide a qualitative analysis by investigating emoji analogy examples and visualizing the emoji embedding space. • emoji2vec can be readily used in social natural language processing applications alongside word2vec.

Method 4 • Maps emoji symbols into the same space
as the 300-dimensional Google News word2vec embeddings. • Crawl emoji: their name and their keyword phrases from the Unicode emoji list, resulting in 6088 descriptions of 1661 emoji symbols.

Method 5 Model • For every training example consisting of
an emoji and a sequence of words w1 , . . . , wN describing that emoji, take the sum of the individual word vectors in the descriptive phrase: = ෍ =1 : the word2vec vector for word wk : the vector representation of the description.

Method 6 Model • Use the logistic loss for training:
• Define a trainable vector for every emoji in training set. • Sigmoid of the dot product σ( ): the probability of a match between the emoji representation xi and its description representation vj • is 1 if description j is valid for emoji i and 0 otherwise.

Method 7 Optimization • Model is • implemented in TensorFlow
(Abadi et al., 2015) • optimized using stochastic gradient descent with Adam (Kingma and Ba, 2015) • Not observe any negative training examples: • invalid descriptions of emoji is not appeared in the original training set • Choosing the same amount of negative and positive samples.

Evaluation 8 • Evaluate on an intrinsic (emoji-description classification) and
extrinsic (Twitter sentiment analysis) task. • Qualitative analysis by visualizing the learned emoji embedding space and investigating emoji analogy examples.

Evaluation 9 • Emoji-Description Classification • Created a manually-labeled test
set containing pairs of emoji and phrases, as well as a correspondence label. • Calculate σ( ) for each example in the test set, measuring the similarity between the emoji vector and the sum of word vectors in the phrase. • Varying the threshold used for this classifier to obtain a receiver operating characteristic curve. ⇒ area-under-the-curve of 0.933, demonstrates that high quality of the learned emoji representations.

Evaluation 10 • Sentiment Analysis on Tweets • Compare the
accuracy of sentiment classification of tweets for various classifiers 1. The original Google News word2vec embeddings. 2. word2vec augmented with emoji embeddings trained by Barbieri et al. (2016). (using skip-gram neural embedding model by (Mikolov et al., 2013)) 3. word2vec augmented with emoji2vec trained from Unicode descriptions. • Datase: • 67k English tweets labelled manually for positive, neutral, or negative sentiment by Kralj Novak et al. (2015) • In both the training set and the test set, 46% of tweets are labeled neutral, 29% are labeled positive, and 25% are labeled negative.

Evaluation 11 • Sentiment Analysis on Tweets

Evaluation 12 • Analogy Task • In word2vec vector representation
of ’king’ minus ’man’ plus ’woman’ is closest to ’queen’. • It is difficult to build such an analogy task for emoji due to the small number and semantically distinct categories of emoji. • The correct answer is sometimes not the top one, it is often contained in the top three.

Evaluation 13 • t-SNE Visualization • Project the learned emoji
embeddings into two-dimensional space using t-SNE (Maaten and Hinton, 2008) • Projects high-dimensional embeddings into a lower dimensional space. • While attempting to preserve relative distances.

Evaluation 14 • t-SNE Visualization

Conclusions 15 • Released emoji2vec — embeddings of 1661 emoji
symbols. • Instead of running word2vec’s skip-gram model on a large collection of emoji and their contexts appearing in tweets • emoji2vec is directly trained on Unicode descriptions of emoji • Might prove especially useful in social NLP tasks where emoji are used frequently (e.g. Twitter, Instagram, etc.)

Future work 16 • Investigate the usefulness of this method
for other Unicode symbol embeddings. • Improve emoji2vec in the future by also reading full text emoji description from Emojipedia3. • Using a recurrent neural network instead of a bag-of-word-vectors for better performance.

[Literature review] emoji2vec: Learning Emoji R...

[Literature review] emoji2vec: Learning Emoji Representations from their Description

vhqviet

More Decks by vhqviet

Other Decks in Education

Featured

Transcript

Literature review: Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko Bošnjak,

Introduction 2 • Many current natural language processing applications for

Introduction 3 • This paper introduces emoji2vec, pre-trained embeddings for

Method 4 • Maps emoji symbols into the same space

Method 5 Model • For every training example consisting of

Method 6 Model • Use the logistic loss for training:

Method 7 Optimization • Model is • implemented in TensorFlow

Evaluation 8 • Evaluate on an intrinsic (emoji-description classification) and

Evaluation 9 • Emoji-Description Classification • Created a manually-labeled test

Evaluation 10 • Sentiment Analysis on Tweets • Compare the

Evaluation 11 • Sentiment Analysis on Tweets

Evaluation 12 • Analogy Task • In word2vec vector representation

Evaluation 13 • t-SNE Visualization • Project the learned emoji

Evaluation 14 • t-SNE Visualization

Conclusions 15 • Released emoji2vec — embeddings of 1661 emoji

Future work 16 • Investigate the usefulness of this method