Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Literature review: word embeddings

Literature review: word embeddings

1. Lenci, A. (2018). Distributional models of word meaning. Annual Review of Linguistics, 4(1), 151–171
- Combined with: Lenci, A. (2008). Distributional approaches in linguistic and cognitive research. Italian Journal of Linguistics, 20(May), 1–31
2. Zeng, Z., Yin, Y., Song, Y., & Zhang, M. (2017). Socialized word embeddings. In International Joint Conference on Artificial Intelligence (pp. 3915–3921)
3. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In NAACL-HLT (pp. 2227–2237)

57bfe84632f84aeb65a42df3cdc04c43?s=128

Shuntaro Yada

June 21, 2018
Tweet

Transcript

  1. Shuntaro Yada PhD Student Methodology Seminar Tech Interest Group 21

    Jun 2018 Literature Review: Word Embeddings
  2. Agenda 1. Lenci, A. (2018). Distributional models of word meaning.

    Annual Review of Linguistics, 4(1), 151–171 ‣ Combined with: Lenci, A. (2008). Distributional approaches in linguistic and cognitive research. Italian Journal of Linguistics, 20(May), 1–31 2. Zeng, Z., Yin, Y., Song, Y., & Zhang, M. (2017). Socialized word embeddings. In International Joint Conference on Artificial Intelligence (pp. 3915–3921) 3. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In NAACL-HLT (pp. 2227–2237) 2 • I will introduce these papers listed on the right • The first one is theoretical overviews and reviews • The last two are examples of cutting-edge methods
  3. Distributional models of word meaning Alessandro Lenci Annual Review of

    Linguistics, 4(1), 151– 171, 2018
  4. Overview • Explain what the distributional hypothesis is • Briefly

    introduce major ways to generate distributional representations of words ‣ I focus on the two most popular ways (Count/ Prediction) and add more materials to explain the latter way • Summarise the common challenges with distributional representations of words 4 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models • Challenges
  5. Distributional Hypothesis • “Lexemes with similar linguistic contexts have similar

    meanings” (Lenci, 2018: p. 152) • One of the ways to give the definition of word meaning 5 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models • Challenges “I found a wonderful restaurant yesterday!” “I found a fantastic restaurant yesterday!” Looks like they have a similar meaning Target word Context words (latter) Context words (former)
  6. Distributional Hypothesis • Distributional hypothesis (DH) forms the theoretical foundation

    of distributional semantics (aka vector space semantics) • Lenci (2008) pointed out two levels of DH: ‣ Weak DH: only assumes correlations between semantics and word distributions ‣ Strong DH: also assumes DH is a cognitive hypothesis 6 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models • Challenges
  7. Distributed vs Distributional • According to Ferrone and Zanzotto (2017),

    distributed representations contain distributional representations • Distributed: ways to represent each word by a vector with several dimensions instead of a symbolic vector (e.g., one-hot vectors) • Distributional: ways to represent each word by a vector with several dimensions based on distributional hypothesis 7 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models • Challenges
  8. • The two major ways to generate distributional representations: ‣

    Count models ‣ Prediction models • The paper shows an introduction of count models (Sec 3.2) 8 Taxonomy of Methods: The Method of Learning • Overview • Distributional Hypothesis • Taxonomy of Methods ‣ The Method of Learning ‣ The Type of Context • Models • Challenges
  9. 9 Taxonomy of Methods: The Type of Context • Overview

    • Distributional Hypothesis • Taxonomy of Methods ‣ The Method of Learning ‣ The Type of Context • Models • Challenges e.g., My example of wonderful/fantastic. Most popular taking into account syntactically dependent words only (e.g., predicate argument structure?) Imagine TFIDF
  10. 10 Count models • Overview • Distributional Hypothesis • Taxonomy

    of Methods • Models ‣ Count models ‣ Prediction models • Challenges Co-occurrence matrix Enhance significance to reflect the importance of the contexts For example, taking:
  11. Count models • Use some methods to obtain latent features

    among columns in explicit vectors → implicit vectors • One easy example is to apply dimensionality- reduction techniques like singular value decomposition or principal components analysis • See Table 2 of Lenci (2018) for famous tools ‣ GloVe (Pennington+, 2014), which is based on weighted least-squares regression, is the most popular word representations among count models 11 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models ‣ Count models ‣ Prediction models • Challenges
  12. Prediction models • As known as word vectors, word embeddings,

    and distributed embeddings • Learn word representations using a neural network model while the model is learning a language model • The most famous tool of this category is word2vec ‣ The rest of the papers I will explain today are also based on prediction models 12 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models ‣ Count models ‣ Prediction models • Challenges
  13. Prediction models • Language model ‣ A probability distribution over

    sequences of words ‣ This assigns a probability of given sequences • Neural language model ‣ A language model using neural network ‣ Given sequences, it predicts the next word of the sequences 13 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models ‣ Count models ‣ Prediction models • Challenges
  14. 14 Prediction models • Overview • Distributional Hypothesis • Taxonomy

    of Methods • Models ‣ Count models ‣ Prediction models • Challenges have a nice one one this has the same number of units as total word types and represents a probability distribution of the next word … Output layer Hidden layers Embedding layer Modified from ௶Ҫ+ (2017) ʰਂ૚ֶशʹΑΔࣗવݴޠॲཧʱ E Win Whid Wout These parts become word vectors during training An example of feed forward neural network language model
  15. 15 Prediction models • Overview • Distributional Hypothesis • Taxonomy

    of Methods • Models ‣ Count models ‣ Prediction models • Challenges have a nice one one … Output layer Embedding layer Modified from ௶Ҫ+ (2017) ʰਂ૚ֶशʹΑΔࣗવݴޠॲཧʱ E word2vec No hidden layers Just taking inner products But with bunch of optimisation techniques
  16. Challenges • Distributional semantic models tend to mix up various

    types of semantic similarity/relatedness ‣ No distinction among hypernymy, antonymy, meronymy, locative relations and topical relations • How to represent larger linguistic units than word 16 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models • Challenges
  17. Socialized Word Embeddings Ziqian Zeng, Yichun Yin, Yangqiu Song, Ming

    Zhang International Joint Conference on Artificial Intelligence 2017
  18. Overview • Add the following two aspect to word embeddings:

    ‣ Personalisation (user information; not new) ‣ Socialisation (inter-user relationship; new) • Three-fold evaluation: ‣ Perplexity comparison between word2vec ‣ Application to document-level sentiment classification ✦ As the features for SVM (inc. user segmentation) ✦ As the attention source for neural models 18 • Overview • Proposed method • Evaluation • Comments
  19. Proposed method: Personalisation • Starting from continuous bag-of-words model (CBOW)

    of word2vec (Mikolov et al., 2013) • Consider the context words for a word as user- dependent ‣ Maximise: ‣ ‘for each user, s/he will think about a predicted word given the global words meanings and customize them to his/her own preference’ 19 • Overview • Proposed method ‣ Personalisation ‣ Socialisation • Evaluation • Comments J1 = N X i X wj 2Wi log P(wj |C(wj), ui) <latexit sha1_base64="P0DTTiq/sMDCisw8h+RKFMCeyNY=">AAACRHicbVBLSwMxGMz6rPVV9eglWAQFKbsi6EUQvYgHqWBboVtDNk1rbDa7JFmlxPw4L/4Ab/4CLx4U8SpmaxFfA4HJzHzky0QpZ0r7/oM3Mjo2PjFZmCpOz8zOzZcWFusqySShNZLwRJ5FWFHOBK1ppjk9SyXFccRpI+od5H7jikrFEnGq+yltxbgrWIcRrJ2ESs0wxvqCYG6OLArgLgxVFp+bY4sMszlH5hpdwpAJ+JVsWMQsDHnSNdU15958OQc2v69vwAyxdYtKZb/iDwD/kmBIymCIKirdh+2EZDEVmnCsVDPwU90yWGpGOLXFMFM0xaSHu7TpqMAxVS0zKMHCVae0YSeR7ggNB+r3CYNjpfpx5JL5uuq3l4v/ec1Md3Zahok001SQz4c6GYc6gXmjsM0kJZr3HcFEMrcrJBdYYqJd70VXQvD7y39JfbMS+JXgZKu8tz+sowCWwQpYAwHYBnvgEFRBDRBwCx7BM3jx7rwn79V7+4yOeMOZJfAD3vsHzviyeQ==</latexit> <latexit sha1_base64="P0DTTiq/sMDCisw8h+RKFMCeyNY=">AAACRHicbVBLSwMxGMz6rPVV9eglWAQFKbsi6EUQvYgHqWBboVtDNk1rbDa7JFmlxPw4L/4Ab/4CLx4U8SpmaxFfA4HJzHzky0QpZ0r7/oM3Mjo2PjFZmCpOz8zOzZcWFusqySShNZLwRJ5FWFHOBK1ppjk9SyXFccRpI+od5H7jikrFEnGq+yltxbgrWIcRrJ2ESs0wxvqCYG6OLArgLgxVFp+bY4sMszlH5hpdwpAJ+JVsWMQsDHnSNdU15958OQc2v69vwAyxdYtKZb/iDwD/kmBIymCIKirdh+2EZDEVmnCsVDPwU90yWGpGOLXFMFM0xaSHu7TpqMAxVS0zKMHCVae0YSeR7ggNB+r3CYNjpfpx5JL5uuq3l4v/ec1Md3Zahok001SQz4c6GYc6gXmjsM0kJZr3HcFEMrcrJBdYYqJd70VXQvD7y39JfbMS+JXgZKu8tz+sowCWwQpYAwHYBnvgEFRBDRBwCx7BM3jx7rwn79V7+4yOeMOZJfAD3vsHzviyeQ==</latexit> <latexit sha1_base64="P0DTTiq/sMDCisw8h+RKFMCeyNY=">AAACRHicbVBLSwMxGMz6rPVV9eglWAQFKbsi6EUQvYgHqWBboVtDNk1rbDa7JFmlxPw4L/4Ab/4CLx4U8SpmaxFfA4HJzHzky0QpZ0r7/oM3Mjo2PjFZmCpOz8zOzZcWFusqySShNZLwRJ5FWFHOBK1ppjk9SyXFccRpI+od5H7jikrFEnGq+yltxbgrWIcRrJ2ESs0wxvqCYG6OLArgLgxVFp+bY4sMszlH5hpdwpAJ+JVsWMQsDHnSNdU15958OQc2v69vwAyxdYtKZb/iDwD/kmBIymCIKirdh+2EZDEVmnCsVDPwU90yWGpGOLXFMFM0xaSHu7TpqMAxVS0zKMHCVae0YSeR7ggNB+r3CYNjpfpx5JL5uuq3l4v/ec1Md3Zahok001SQz4c6GYc6gXmjsM0kJZr3HcFEMrcrJBdYYqJd70VXQvD7y39JfbMS+JXgZKu8tz+sowCWwQpYAwHYBnvgEFRBDRBwCx7BM3jx7rwn79V7+4yOeMOZJfAD3vsHzviyeQ==</latexit> <latexit sha1_base64="P0DTTiq/sMDCisw8h+RKFMCeyNY=">AAACRHicbVBLSwMxGMz6rPVV9eglWAQFKbsi6EUQvYgHqWBboVtDNk1rbDa7JFmlxPw4L/4Ab/4CLx4U8SpmaxFfA4HJzHzky0QpZ0r7/oM3Mjo2PjFZmCpOz8zOzZcWFusqySShNZLwRJ5FWFHOBK1ppjk9SyXFccRpI+od5H7jikrFEnGq+yltxbgrWIcRrJ2ESs0wxvqCYG6OLArgLgxVFp+bY4sMszlH5hpdwpAJ+JVsWMQsDHnSNdU15958OQc2v69vwAyxdYtKZb/iDwD/kmBIymCIKirdh+2EZDEVmnCsVDPwU90yWGpGOLXFMFM0xaSHu7TpqMAxVS0zKMHCVae0YSeR7ggNB+r3CYNjpfpx5JL5uuq3l4v/ec1Md3Zahok001SQz4c6GYc6gXmjsM0kJZr3HcFEMrcrJBdYYqJd70VXQvD7y39JfbMS+JXgZKu8tz+sowCWwQpYAwHYBnvgEFRBDRBwCx7BM3jx7rwn79V7+4yOeMOZJfAD3vsHzviyeQ==</latexit>
  20. • Word vectors and user vectors have the same dimensionality

    • The user-dependent word vector is represented as the sum of the word vector and the user vector 20 Proposed method: Personalisation • Overview • Proposed method ‣ Personalisation ‣ Socialisation • Evaluation • Comments w(i) j = wj + ui <latexit sha1_base64="ice+/tJd2u0YbJ7CSX4HnNpOHWE=">AAACG3icbVDLSsNAFJ3UV62vqEs3g0WoCCUpgm6EohuXFewD2hom00k77WQSZiZKCfkPN/6KGxeKuBJc+DdO2uCj9cDAmXPu5d573JBRqSzr08gtLC4tr+RXC2vrG5tb5vZOQwaRwKSOAxaIloskYZSTuqKKkVYoCPJdRpru6CL1m7dESBrwazUOSddHfU49ipHSkmNWOj5SA9eL75KbuEQPEyceJvAM/sjOEB59f6PEoY5ZtMrWBHCe2Bkpggw1x3zv9AIc+YQrzJCUbdsKVTdGQlHMSFLoRJKECI9Qn7Q15cgnshtPbkvggVZ60AuEflzBifq7I0a+lGPf1ZXpjnLWS8X/vHakvNNuTHkYKcLxdJAXMagCmAYFe1QQrNhYE4QF1btCPEACYaXjLOgQ7NmT50mjUratsn11XKyeZ3HkwR7YByVggxNQBZegBuoAg3vwCJ7Bi/FgPBmvxtu0NGdkPbvgD4yPL2SEoZ4=</latexit> <latexit sha1_base64="ice+/tJd2u0YbJ7CSX4HnNpOHWE=">AAACG3icbVDLSsNAFJ3UV62vqEs3g0WoCCUpgm6EohuXFewD2hom00k77WQSZiZKCfkPN/6KGxeKuBJc+DdO2uCj9cDAmXPu5d573JBRqSzr08gtLC4tr+RXC2vrG5tb5vZOQwaRwKSOAxaIloskYZSTuqKKkVYoCPJdRpru6CL1m7dESBrwazUOSddHfU49ipHSkmNWOj5SA9eL75KbuEQPEyceJvAM/sjOEB59f6PEoY5ZtMrWBHCe2Bkpggw1x3zv9AIc+YQrzJCUbdsKVTdGQlHMSFLoRJKECI9Qn7Q15cgnshtPbkvggVZ60AuEflzBifq7I0a+lGPf1ZXpjnLWS8X/vHakvNNuTHkYKcLxdJAXMagCmAYFe1QQrNhYE4QF1btCPEACYaXjLOgQ7NmT50mjUratsn11XKyeZ3HkwR7YByVggxNQBZegBuoAg3vwCJ7Bi/FgPBmvxtu0NGdkPbvgD4yPL2SEoZ4=</latexit> <latexit sha1_base64="ice+/tJd2u0YbJ7CSX4HnNpOHWE=">AAACG3icbVDLSsNAFJ3UV62vqEs3g0WoCCUpgm6EohuXFewD2hom00k77WQSZiZKCfkPN/6KGxeKuBJc+DdO2uCj9cDAmXPu5d573JBRqSzr08gtLC4tr+RXC2vrG5tb5vZOQwaRwKSOAxaIloskYZSTuqKKkVYoCPJdRpru6CL1m7dESBrwazUOSddHfU49ipHSkmNWOj5SA9eL75KbuEQPEyceJvAM/sjOEB59f6PEoY5ZtMrWBHCe2Bkpggw1x3zv9AIc+YQrzJCUbdsKVTdGQlHMSFLoRJKECI9Qn7Q15cgnshtPbkvggVZ60AuEflzBifq7I0a+lGPf1ZXpjnLWS8X/vHakvNNuTHkYKcLxdJAXMagCmAYFe1QQrNhYE4QF1btCPEACYaXjLOgQ7NmT50mjUratsn11XKyeZ3HkwR7YByVggxNQBZegBuoAg3vwCJ7Bi/FgPBmvxtu0NGdkPbvgD4yPL2SEoZ4=</latexit> <latexit sha1_base64="ice+/tJd2u0YbJ7CSX4HnNpOHWE=">AAACG3icbVDLSsNAFJ3UV62vqEs3g0WoCCUpgm6EohuXFewD2hom00k77WQSZiZKCfkPN/6KGxeKuBJc+DdO2uCj9cDAmXPu5d573JBRqSzr08gtLC4tr+RXC2vrG5tb5vZOQwaRwKSOAxaIloskYZSTuqKKkVYoCPJdRpru6CL1m7dESBrwazUOSddHfU49ipHSkmNWOj5SA9eL75KbuEQPEyceJvAM/sjOEB59f6PEoY5ZtMrWBHCe2Bkpggw1x3zv9AIc+YQrzJCUbdsKVTdGQlHMSFLoRJKECI9Qn7Q15cgnshtPbkvggVZ60AuEflzBifq7I0a+lGPf1ZXpjnLWS8X/vHakvNNuTHkYKcLxdJAXMagCmAYFe1QQrNhYE4QF1btCPEACYaXjLOgQ7NmT50mjUratsn11XKyeZ3HkwR7YByVggxNQBZegBuoAg3vwCJ7Bi/FgPBmvxtu0NGdkPbvgD4yPL2SEoZ4=</latexit>
  21. Proposed method: Personalisation • To deal with learning on the

    large vocabulary, hierarchical softmax with the Huffman tree built based on word frequencies is used • Parameters in optimisation function (J1), word vectors, and user vectors are updated by Stochastic Gradient Descent (SGD) 21 • Overview • Proposed method ‣ Personalisation ‣ Socialisation • Evaluation • Comments
  22. Proposed method: Socialisation • Homophily in social networks (Lazarsfeld and

    Merton, 1954; McPherson et al., 2001) ‣ someone’s friends tend to share similar opinions or topics • User vectors in friendship (neighbours) should be similar ‣ Minimise: ‣ SGD is also applied here to update user vectors 22 • Overview • Proposed method ‣ Personalisation ‣ Socialisation • Evaluation • Comments
  23. • Incorporating socialisation makes user vectors being updated more frequently

    than word vectors • Introduce a constraint for user vectors’ L2- norm 23 Proposed method: Socialisation • Overview • Proposed method ‣ Personalisation ‣ Socialisation • Evaluation • Comments
  24. 24 Dataset • Overview • Proposed method • Evaluation ‣

    Dataset ‣ Perplexity ‣ Sentiment Classification • Comments
  25. 25 Perplexity • 6-gram perplexity • Changing the importance of

    social regularisation (λ) and the strength of personalisation (r) ‣ From the shape of curves, they can be optimised for the given dataset
  26. Sentiment Classification • Use socialised word embeddings to a downstream

    task ‣ They chose document-level sentiment classification (to predict ratings of Yelp reviews) • Check two aspects: ‣ User segmentation (active users or not) ‣ Applicability for attention vectors in neural models 26 • Overview • Proposed method • Evaluation ‣ Dataset ‣ Perplexity ‣ Sentiment Classification • Comments
  27. 27 Sentiment Classification: SVM and user segmentation • Overview •

    Proposed method • Evaluation ‣ Dataset ‣ Perplexity ‣ Sentiment Classification • Comments • Split users by published reviews ‣ Total num. of reviews are the same in both segments • Use the average of word vectors in a document as features for SVM
  28. Sentiment Classification: NN models and user attention In Yelp review

    prediction task, some papers proposed neural models that apply (extra-)attention mechanism to users 28 • Overview • Proposed method • Evaluation ‣ Dataset ‣ Perplexity ‣ Sentiment Classification • Comments (e.g.) Chen et al., 2016
  29. Sentiment Classification: NN models and user attention • How about

    using socialised word embeddings as “fixed” attention vectors in those models? • Better than without attention but slightly worse than original models (with attention vector trained) 29 • Overview • Proposed method • Evaluation ‣ Dataset ‣ Perplexity ‣ Sentiment Classification • Comments
  30. Comments • Socialisation looks a very interesting/promising idea ‣ Not

    significant performance gains though ‣ How to regularise sociality has room for improvement ✦ Regarding neighbour users should-be similar seems too naive and strong 30 • Overview • Proposed method • Evaluation • Comments
  31. Comments • Their source code is available: ‣ https://github.com/HKUST-KnowComp/ SocializedWordEmbeddings

    • They have just published improved version of socialised word embeddings this year ‣ https://github.com/HKUST-KnowComp/SRBRW 31 • Overview • Proposed method • Evaluation • Comments @inproceedings{zeng2018biased, title={Biased Random Walk based Social Regularization for Word Embeddings}, author={Zeng, Ziqian and Liu, Xin and Song, Yangqiu}, booktitle={IJCAI}, pages={XX-YY}, year={2018}, }
  32. Deep contextualized word representations Matthew E. Peters, Mark Neumann, Mohit

    Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer NAACL-HLT 2018
  33. Overview • Propose a new type of deep contextualised word

    representations (ELMo) that model: ‣ Complex characteristics of word use (e.g., syntax and semantics) ‣ How these uses vary across linguistic contexts (i.e., to model polysemy) • Show that ELMo can improve existing neural models in various NLP tasks • Argue that ELMo can capture more abstract linguistic characteristics in the higher level of layers 33 • Overview • Method • Evaluation • Analysis • Comments
  34. Example 34 GloVe mostly learns sport-related context ELMo can distinguish

    the word sense based on the context
  35. Method • Embeddings from Language Models: ELMo • Learn word

    embeddings through building bidirectional language models (biLMs) ‣ biLMs consist of forward and backward LMs ✦ Forward: ✦ Backward: 35 • Overview • Method • Evaluation • Analysis • Comments p(t1 , t2 , …, tN ) = N ∏ k=1 p(tk |t1 , t2 , …, tk−1 ) p(t1 , t2 , …, tN ) = N ∏ k=1 p(tk |tk+1 , tk+2 , …, tN )
  36. Method With long short term memory (LSTM) network, predicting the

    next words in both directions to build biLMs 36 have a nice one … Output layer Hidden layers (LSTMs) Embedding layer … a nice one xk ok k − 1 k − 1 Expanded in the forward direction of k The forward LM architecture • Overview • Method • Evaluation • Analysis • Comments tk hLM k1 hLM k2
  37. Method ELMo represents a word ɹ as a linear combination

    of corresponding hidden layers (inc. its embedding) 37 tk xk ok k − 1 k − 1 hLM k1 ok k + 1 k + 1 hLM k2 hLM k1 hLM k2 Forward LM Backward LM tk tk stask 0 stask 1 stask 2 Concatenate hidden layers hLM k1 hLM k2 hLM k0 ([xk ; xk ]) × × × [hLM kj ; hLM kj ] { ELMotask k × γtask ∑ = Unlike usual word embeddings, ELMo is assigned to every token instead of a type biLMs ELMo is a task specific representation. A down-stream task learns weighting parameters
  38. ELMo can be integrated to almost all neural NLP tasks

    with simple concatenation to the embedding layer Method 38 • Overview • Method • Evaluation • Analysis • Comments have a nice ELMo ELMo ELMo biLMs Corpus Usual inputs Enhance inputs with ELMos Train
  39. Many linguistic tasks are improved by using ELMo 39 Evaluation

    • Overview • Method • Evaluation • Analysis • Comments Q&A Textual entailment Sentiment analysis Semantic role labelling Coreference resolution Named entity recognition
  40. Analysis The higher layer seemed to learn semantics while the

    lower layer probably captured syntactic features 40 • Overview • Method • Evaluation • Analysis • Comments Word sense disambiguation PoS tagging
  41. Analysis The higher layer seemed to learn semantics while the

    lower layer probably captured syntactic features??? 41 • Overview • Method • Evaluation • Analysis • Comments Most models preferred “syntactic (probably)” features Even in sentiment analysis
  42. Analysis ELMo-enhanced models can make use of small datasets more

    efficiently 42 • Overview • Method • Evaluation • Analysis • Comments Textual entailment Semantic role labelling
  43. Comments • Pre-trained ELMo models are available at https:// allennlp.org/elmo

    ‣ AllenNLP is a deep NLP library on top of PyTorch ‣ AllenNLP is a product of AI2 (Allen Institute for Artificial Intelligence) which works on other interesting projects like Semantic Scholar • ELMo can process character-level inputs ‣ Japanese (Chinese, Korean, …) ELMo models likely to be possible 43 • Overview • Method • Evaluation • Analysis • Comments