Implicit Emotion Classification with Deep Contextualized Word Representations

Slide 1

Slide 1 text

IIIDYT AT IEST 2018: IMPLICIT EMOTION IIIDYT AT IEST 2018: IMPLICIT EMOTION CLASSIFICATION WITH DEEP CLASSIFICATION WITH DEEP CONTEXTUALIZED WORD CONTEXTUALIZED WORD REPRESENTATIONS REPRESENTATIONS Jorge A. Balazs, Edison Marrese-Taylor, Yutaka Matsuo https://arxiv.org/abs/1808.08672 1

Slide 2

Slide 2 text

INTRODUCTION INTRODUCTION 2

Slide 3

Slide 3 text

PROPOSED PROPOSED APPROACH APPROACH 3

Slide 4

Slide 4 text

PREPROCESSING PREPROCESSING 4

Slide 5

Slide 5 text

PREPROCESSING PREPROCESSING We wanted to have a single format for special tokens 4

Slide 6

Slide 6 text

PREPROCESSING PREPROCESSING We wanted to have a single format for special tokens The replacements were chosen arbitrarily 4

Slide 7

Slide 7 text

PREPROCESSING PREPROCESSING We wanted to have a single format for special tokens The replacements were chosen arbitrarily Shorter replacements did not impact performance significantly 4

Slide 8

Slide 8 text

PREPROCESSING PREPROCESSING We wanted to have a single format for special tokens The replacements were chosen arbitrarily Shorter replacements did not impact performance significantly Completely removing [#TRIGGERWORD#] had a negative impact in our best model. 4

Slide 9

Slide 9 text

Slide 10

Slide 10 text

ARCHITECTURE ARCHITECTURE 5

Slide 11

Slide 11 text

HYPERPARAMETERS HYPERPARAMETERS ELMo Layer Oﬀicial implementation with default parameters Dimensionalities ELMo output = BiLSTM output = for each direction Sentence vector representation = Fully-connected (FC) layer input = FC layer hidden = FC layer output = Loss Function Cross-Entropy Optimizer Default Adam ( , , ) Learning Rate Slanted triangular schedule ( ) (Howard and Ruder ) Regularization Dropout ( a er Elmo Layer and FC hidden; a er max-pooling layer) 2018 6

Slide 12

Slide 12 text

ENSEMBLES ENSEMBLES 7

Slide 13

Slide 13 text

ENSEMBLES ENSEMBLES We tried combinations of 9 trained models initialized with diﬀerent random seeds. 7

Slide 14

Slide 14 text

ENSEMBLES ENSEMBLES We tried combinations of 9 trained models initialized with diﬀerent random seeds. Similar to Bonab and Can ( ), we found out that ensembling 6 models yielded the best results. 2016 7

Slide 15

Slide 15 text

EXPERIMENTS AND EXPERIMENTS AND ANALYSES ANALYSES 8

Slide 16

Slide 16 text

ABLATION STUDY ABLATION STUDY 9

Slide 17

Slide 17 text

ABLATION STUDY ABLATION STUDY ELMo provided the biggest boost in performance. 9

Slide 18

Slide 18 text

ABLATION STUDY ABLATION STUDY ELMo provided the biggest boost in performance. Emoji also helped ( ). analysis 9

Slide 19

Slide 19 text

ABLATION STUDY ABLATION STUDY ELMo provided the biggest boost in performance. Emoji also helped ( ). Concat pooling (Howard and Ruder ), did not help. analysis 2018 9

Slide 20

Slide 20 text

Slide 21

Slide 21 text

ABLATION STUDY ABLATION STUDY ELMo provided the biggest boost in performance. Emoji also helped ( ). Concat pooling (Howard and Ruder ), did not help. Diﬀerent BiLSTM sizes did not improve results. POS tag embeddings of dimension 50 slightly helped. analysis 2018 9

Slide 22

Slide 22 text

Slide 23

Slide 23 text

ABLATION STUDY ABLATION STUDY Dropout 10

Slide 24

Slide 24 text

ABLATION STUDY ABLATION STUDY Dropout Best dropout configurations concentrated around high values for word-level representations, and low values for sentence-level representations. 10

Slide 25

Slide 25 text

ERROR ANALYSIS ERROR ANALYSIS Confusion Matrix Classification Report

Slide 26

Slide 26 text

ERROR ANALYSIS ERROR ANALYSIS Confusion Matrix Classification Report anger was the hardest class to predict.

Slide 27

Slide 27 text

ERROR ANALYSIS ERROR ANALYSIS Confusion Matrix Classification Report anger was the hardest class to predict. joy was the easiest one

Slide 28

Slide 28 text

ERROR ANALYSIS ERROR ANALYSIS Confusion Matrix Classification Report anger was the hardest class to predict. joy was the easiest one (probably due to an annotation artifact).

Slide 29

Slide 29 text

ERROR ANALYSIS ERROR ANALYSIS PCA projection of test sentence representations 12

Slide 30

Slide 30 text

ERROR ANALYSIS ERROR ANALYSIS PCA projection of test sentence representations Separate joy cluster corresponds to those sentences containing the “un[#TRIGGERWORD#]” pattern. 12

Slide 31

Slide 31 text

AMOUNT OF TRAINING DATA AMOUNT OF TRAINING DATA 13

Slide 32

Slide 32 text

AMOUNT OF TRAINING DATA AMOUNT OF TRAINING DATA Upward trend suggests that the model is expressive enough to learn from new data, and is not overfitting the training set. 13

Slide 33

Slide 33 text

EMOJI & HASHTAGS EMOJI & HASHTAGS Number of examples with and without emoji and hashtags. Numbers between parentheses correspond to the percentage of examples classified correctly. 14

Slide 34

Slide 34 text

EMOJI & HASHTAGS EMOJI & HASHTAGS Number of examples with and without emoji and hashtags. Numbers between parentheses correspond to the percentage of examples classified correctly. Tweets and hashtags (to a lesser extent), seem to be good discriminating features. 14

Slide 35

Slide 35 text

EMOJI & HASHTAGS EMOJI & HASHTAGS ❤ 15

Slide 36

Slide 36 text

EMOJI & HASHTAGS EMOJI & HASHTAGS ❤ rage , mask , and cry , were the most informative emoji. 15

Slide 37

Slide 37 text

EMOJI & HASHTAGS EMOJI & HASHTAGS ❤ rage , mask , and cry , were the most informative emoji. Counterintuitively, sob was less informative than , despite representing a stronger emotion. 15

Slide 38

Slide 38 text

EMOJI & HASHTAGS EMOJI & HASHTAGS ❤ rage , mask , and cry , were the most informative emoji. Counterintuitively, sob was less informative than , despite representing a stronger emotion. Removing sweat_smile and confused improved results. 15

Slide 39

Slide 39 text

CONCLUSIONS CONCLUSIONS 16

Slide 40

Slide 40 text

CONCLUSIONS CONCLUSIONS We obtained competitive results with: 16

Slide 41

Slide 41 text

CONCLUSIONS CONCLUSIONS We obtained competitive results with: simple preprocessing, 16

Slide 42

Slide 42 text

CONCLUSIONS CONCLUSIONS We obtained competitive results with: simple preprocessing, almost no external data dependencies (save for the pretrained ELMo language model), 16

Slide 43

Slide 43 text

CONCLUSIONS CONCLUSIONS We obtained competitive results with: simple preprocessing, almost no external data dependencies (save for the pretrained ELMo language model), a simple architecture. 16

Slide 44

Slide 44 text

CONCLUSIONS CONCLUSIONS 17

Slide 45

Slide 45 text

CONCLUSIONS CONCLUSIONS We showed that: 17

Slide 46

Slide 46 text

CONCLUSIONS CONCLUSIONS We showed that: The “un[#TRIGGERWORD#]” artifact had significant impact in the final example representations (as shown by the PCA projection). 17

Slide 47

Slide 47 text

CONCLUSIONS CONCLUSIONS We showed that: The “un[#TRIGGERWORD#]” artifact had significant impact in the final example representations (as shown by the PCA projection). This in turn made the model better at classifying joy examples. 17

Slide 48

Slide 48 text

Slide 49

Slide 49 text

FUTURE WORK FUTURE WORK 18

Slide 50

Slide 50 text

FUTURE WORK FUTURE WORK Ensemble models with added POS tag features. 18

Slide 51

Slide 51 text

FUTURE WORK FUTURE WORK Ensemble models with added POS tag features. Perform fine-grained hashtag analysis. 18

Slide 52

Slide 52 text

FUTURE WORK FUTURE WORK Ensemble models with added POS tag features. Perform fine-grained hashtag analysis. Implement architectural improvements. 18

Slide 53

Slide 53 text

CLOSING WORDS CLOSING WORDS Our implementation is available at: https://github.com/jabalazs/implicit_emotion 19

Slide 54

Slide 54 text

REFERENCES REFERENCES Bonab, Hamed R., and Fazli Can. 2016. “A Theoretical Framework on the Ideal Number of Classifiers for Online Ensembles in Data Streams.” In Proceedings of the 25th Acm International on Conference on Information and Knowledge Management, 2053–6. CIKM ’16. New York, NY, USA: ACM. . Conneau, Alexis, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. “Supervised Learning of Universal Sentence Representations from Natural Language Inference Data.” In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 670–80. Copenhagen, Denmark: Association for Computational Linguistics. . Howard, Jeremy, and Sebastian Ruder. 2018. “Universal Language Model Fine-tuning for Text Classification.” ArXiv E-Prints. . https://doi.org/10.1145/2983323.2983907 https://www.aclweb.org/anthology/D17-1070 http://arxiv.org/abs/1801.06146 20