Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Literature review: word embeddings

Literature review: word embeddings

1. Lenci, A. (2018). Distributional models of word meaning. Annual Review of Linguistics, 4(1), 151–171
- Combined with: Lenci, A. (2008). Distributional approaches in linguistic and cognitive research. Italian Journal of Linguistics, 20(May), 1–31
2. Zeng, Z., Yin, Y., Song, Y., & Zhang, M. (2017). Socialized word embeddings. In International Joint Conference on Artificial Intelligence (pp. 3915–3921)
3. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In NAACL-HLT (pp. 2227–2237)

Shuntaro Yada

June 21, 2018
Tweet

More Decks by Shuntaro Yada

Other Decks in Research

Transcript

  1. Shuntaro Yada PhD Student Methodology Seminar Tech Interest Group 21

    Jun 2018 Literature Review: Word Embeddings
  2. Agenda 1. Lenci, A. (2018). Distributional models of word meaning.

    Annual Review of Linguistics, 4(1), 151–171 ‣ Combined with: Lenci, A. (2008). Distributional approaches in linguistic and cognitive research. Italian Journal of Linguistics, 20(May), 1–31 2. Zeng, Z., Yin, Y., Song, Y., & Zhang, M. (2017). Socialized word embeddings. In International Joint Conference on Artificial Intelligence (pp. 3915–3921) 3. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In NAACL-HLT (pp. 2227–2237) 2 • I will introduce these papers listed on the right • The first one is theoretical overviews and reviews • The last two are examples of cutting-edge methods
  3. Overview • Explain what the distributional hypothesis is • Briefly

    introduce major ways to generate distributional representations of words ‣ I focus on the two most popular ways (Count/ Prediction) and add more materials to explain the latter way • Summarise the common challenges with distributional representations of words 4 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models • Challenges
  4. Distributional Hypothesis • “Lexemes with similar linguistic contexts have similar

    meanings” (Lenci, 2018: p. 152) • One of the ways to give the definition of word meaning 5 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models • Challenges “I found a wonderful restaurant yesterday!” “I found a fantastic restaurant yesterday!” Looks like they have a similar meaning Target word Context words (latter) Context words (former)
  5. Distributional Hypothesis • Distributional hypothesis (DH) forms the theoretical foundation

    of distributional semantics (aka vector space semantics) • Lenci (2008) pointed out two levels of DH: ‣ Weak DH: only assumes correlations between semantics and word distributions ‣ Strong DH: also assumes DH is a cognitive hypothesis 6 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models • Challenges
  6. Distributed vs Distributional • According to Ferrone and Zanzotto (2017),

    distributed representations contain distributional representations • Distributed: ways to represent each word by a vector with several dimensions instead of a symbolic vector (e.g., one-hot vectors) • Distributional: ways to represent each word by a vector with several dimensions based on distributional hypothesis 7 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models • Challenges
  7. • The two major ways to generate distributional representations: ‣

    Count models ‣ Prediction models • The paper shows an introduction of count models (Sec 3.2) 8 Taxonomy of Methods: The Method of Learning • Overview • Distributional Hypothesis • Taxonomy of Methods ‣ The Method of Learning ‣ The Type of Context • Models • Challenges
  8. 9 Taxonomy of Methods: The Type of Context • Overview

    • Distributional Hypothesis • Taxonomy of Methods ‣ The Method of Learning ‣ The Type of Context • Models • Challenges e.g., My example of wonderful/fantastic. Most popular taking into account syntactically dependent words only (e.g., predicate argument structure?) Imagine TFIDF
  9. 10 Count models • Overview • Distributional Hypothesis • Taxonomy

    of Methods • Models ‣ Count models ‣ Prediction models • Challenges Co-occurrence matrix Enhance significance to reflect the importance of the contexts For example, taking:
  10. Count models • Use some methods to obtain latent features

    among columns in explicit vectors → implicit vectors • One easy example is to apply dimensionality- reduction techniques like singular value decomposition or principal components analysis • See Table 2 of Lenci (2018) for famous tools ‣ GloVe (Pennington+, 2014), which is based on weighted least-squares regression, is the most popular word representations among count models 11 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models ‣ Count models ‣ Prediction models • Challenges
  11. Prediction models • As known as word vectors, word embeddings,

    and distributed embeddings • Learn word representations using a neural network model while the model is learning a language model • The most famous tool of this category is word2vec ‣ The rest of the papers I will explain today are also based on prediction models 12 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models ‣ Count models ‣ Prediction models • Challenges
  12. Prediction models • Language model ‣ A probability distribution over

    sequences of words ‣ This assigns a probability of given sequences • Neural language model ‣ A language model using neural network ‣ Given sequences, it predicts the next word of the sequences 13 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models ‣ Count models ‣ Prediction models • Challenges
  13. 14 Prediction models • Overview • Distributional Hypothesis • Taxonomy

    of Methods • Models ‣ Count models ‣ Prediction models • Challenges have a nice one one this has the same number of units as total word types and represents a probability distribution of the next word … Output layer Hidden layers Embedding layer Modified from ௶Ҫ+ (2017) ʰਂ૚ֶशʹΑΔࣗવݴޠॲཧʱ E Win Whid Wout These parts become word vectors during training An example of feed forward neural network language model
  14. 15 Prediction models • Overview • Distributional Hypothesis • Taxonomy

    of Methods • Models ‣ Count models ‣ Prediction models • Challenges have a nice one one … Output layer Embedding layer Modified from ௶Ҫ+ (2017) ʰਂ૚ֶशʹΑΔࣗવݴޠॲཧʱ E word2vec No hidden layers Just taking inner products But with bunch of optimisation techniques
  15. Challenges • Distributional semantic models tend to mix up various

    types of semantic similarity/relatedness ‣ No distinction among hypernymy, antonymy, meronymy, locative relations and topical relations • How to represent larger linguistic units than word 16 • Overview • Distributional Hypothesis • Taxonomy of Methods • Models • Challenges
  16. Socialized Word Embeddings Ziqian Zeng, Yichun Yin, Yangqiu Song, Ming

    Zhang International Joint Conference on Artificial Intelligence 2017
  17. Overview • Add the following two aspect to word embeddings:

    ‣ Personalisation (user information; not new) ‣ Socialisation (inter-user relationship; new) • Three-fold evaluation: ‣ Perplexity comparison between word2vec ‣ Application to document-level sentiment classification ✦ As the features for SVM (inc. user segmentation) ✦ As the attention source for neural models 18 • Overview • Proposed method • Evaluation • Comments
  18. Proposed method: Personalisation • Starting from continuous bag-of-words model (CBOW)

    of word2vec (Mikolov et al., 2013) • Consider the context words for a word as user- dependent ‣ Maximise: ‣ ‘for each user, s/he will think about a predicted word given the global words meanings and customize them to his/her own preference’ 19 • Overview • Proposed method ‣ Personalisation ‣ Socialisation • Evaluation • Comments J1 = N X i X wj 2Wi log P(wj |C(wj), ui) <latexit sha1_base64="P0DTTiq/sMDCisw8h+RKFMCeyNY=">AAACRHicbVBLSwMxGMz6rPVV9eglWAQFKbsi6EUQvYgHqWBboVtDNk1rbDa7JFmlxPw4L/4Ab/4CLx4U8SpmaxFfA4HJzHzky0QpZ0r7/oM3Mjo2PjFZmCpOz8zOzZcWFusqySShNZLwRJ5FWFHOBK1ppjk9SyXFccRpI+od5H7jikrFEnGq+yltxbgrWIcRrJ2ESs0wxvqCYG6OLArgLgxVFp+bY4sMszlH5hpdwpAJ+JVsWMQsDHnSNdU15958OQc2v69vwAyxdYtKZb/iDwD/kmBIymCIKirdh+2EZDEVmnCsVDPwU90yWGpGOLXFMFM0xaSHu7TpqMAxVS0zKMHCVae0YSeR7ggNB+r3CYNjpfpx5JL5uuq3l4v/ec1Md3Zahok001SQz4c6GYc6gXmjsM0kJZr3HcFEMrcrJBdYYqJd70VXQvD7y39JfbMS+JXgZKu8tz+sowCWwQpYAwHYBnvgEFRBDRBwCx7BM3jx7rwn79V7+4yOeMOZJfAD3vsHzviyeQ==</latexit> <latexit sha1_base64="P0DTTiq/sMDCisw8h+RKFMCeyNY=">AAACRHicbVBLSwMxGMz6rPVV9eglWAQFKbsi6EUQvYgHqWBboVtDNk1rbDa7JFmlxPw4L/4Ab/4CLx4U8SpmaxFfA4HJzHzky0QpZ0r7/oM3Mjo2PjFZmCpOz8zOzZcWFusqySShNZLwRJ5FWFHOBK1ppjk9SyXFccRpI+od5H7jikrFEnGq+yltxbgrWIcRrJ2ESs0wxvqCYG6OLArgLgxVFp+bY4sMszlH5hpdwpAJ+JVsWMQsDHnSNdU15958OQc2v69vwAyxdYtKZb/iDwD/kmBIymCIKirdh+2EZDEVmnCsVDPwU90yWGpGOLXFMFM0xaSHu7TpqMAxVS0zKMHCVae0YSeR7ggNB+r3CYNjpfpx5JL5uuq3l4v/ec1Md3Zahok001SQz4c6GYc6gXmjsM0kJZr3HcFEMrcrJBdYYqJd70VXQvD7y39JfbMS+JXgZKu8tz+sowCWwQpYAwHYBnvgEFRBDRBwCx7BM3jx7rwn79V7+4yOeMOZJfAD3vsHzviyeQ==</latexit> <latexit sha1_base64="P0DTTiq/sMDCisw8h+RKFMCeyNY=">AAACRHicbVBLSwMxGMz6rPVV9eglWAQFKbsi6EUQvYgHqWBboVtDNk1rbDa7JFmlxPw4L/4Ab/4CLx4U8SpmaxFfA4HJzHzky0QpZ0r7/oM3Mjo2PjFZmCpOz8zOzZcWFusqySShNZLwRJ5FWFHOBK1ppjk9SyXFccRpI+od5H7jikrFEnGq+yltxbgrWIcRrJ2ESs0wxvqCYG6OLArgLgxVFp+bY4sMszlH5hpdwpAJ+JVsWMQsDHnSNdU15958OQc2v69vwAyxdYtKZb/iDwD/kmBIymCIKirdh+2EZDEVmnCsVDPwU90yWGpGOLXFMFM0xaSHu7TpqMAxVS0zKMHCVae0YSeR7ggNB+r3CYNjpfpx5JL5uuq3l4v/ec1Md3Zahok001SQz4c6GYc6gXmjsM0kJZr3HcFEMrcrJBdYYqJd70VXQvD7y39JfbMS+JXgZKu8tz+sowCWwQpYAwHYBnvgEFRBDRBwCx7BM3jx7rwn79V7+4yOeMOZJfAD3vsHzviyeQ==</latexit> <latexit sha1_base64="P0DTTiq/sMDCisw8h+RKFMCeyNY=">AAACRHicbVBLSwMxGMz6rPVV9eglWAQFKbsi6EUQvYgHqWBboVtDNk1rbDa7JFmlxPw4L/4Ab/4CLx4U8SpmaxFfA4HJzHzky0QpZ0r7/oM3Mjo2PjFZmCpOz8zOzZcWFusqySShNZLwRJ5FWFHOBK1ppjk9SyXFccRpI+od5H7jikrFEnGq+yltxbgrWIcRrJ2ESs0wxvqCYG6OLArgLgxVFp+bY4sMszlH5hpdwpAJ+JVsWMQsDHnSNdU15958OQc2v69vwAyxdYtKZb/iDwD/kmBIymCIKirdh+2EZDEVmnCsVDPwU90yWGpGOLXFMFM0xaSHu7TpqMAxVS0zKMHCVae0YSeR7ggNB+r3CYNjpfpx5JL5uuq3l4v/ec1Md3Zahok001SQz4c6GYc6gXmjsM0kJZr3HcFEMrcrJBdYYqJd70VXQvD7y39JfbMS+JXgZKu8tz+sowCWwQpYAwHYBnvgEFRBDRBwCx7BM3jx7rwn79V7+4yOeMOZJfAD3vsHzviyeQ==</latexit>
  19. • Word vectors and user vectors have the same dimensionality

    • The user-dependent word vector is represented as the sum of the word vector and the user vector 20 Proposed method: Personalisation • Overview • Proposed method ‣ Personalisation ‣ Socialisation • Evaluation • Comments w(i) j = wj + ui <latexit sha1_base64="ice+/tJd2u0YbJ7CSX4HnNpOHWE=">AAACG3icbVDLSsNAFJ3UV62vqEs3g0WoCCUpgm6EohuXFewD2hom00k77WQSZiZKCfkPN/6KGxeKuBJc+DdO2uCj9cDAmXPu5d573JBRqSzr08gtLC4tr+RXC2vrG5tb5vZOQwaRwKSOAxaIloskYZSTuqKKkVYoCPJdRpru6CL1m7dESBrwazUOSddHfU49ipHSkmNWOj5SA9eL75KbuEQPEyceJvAM/sjOEB59f6PEoY5ZtMrWBHCe2Bkpggw1x3zv9AIc+YQrzJCUbdsKVTdGQlHMSFLoRJKECI9Qn7Q15cgnshtPbkvggVZ60AuEflzBifq7I0a+lGPf1ZXpjnLWS8X/vHakvNNuTHkYKcLxdJAXMagCmAYFe1QQrNhYE4QF1btCPEACYaXjLOgQ7NmT50mjUratsn11XKyeZ3HkwR7YByVggxNQBZegBuoAg3vwCJ7Bi/FgPBmvxtu0NGdkPbvgD4yPL2SEoZ4=</latexit> <latexit sha1_base64="ice+/tJd2u0YbJ7CSX4HnNpOHWE=">AAACG3icbVDLSsNAFJ3UV62vqEs3g0WoCCUpgm6EohuXFewD2hom00k77WQSZiZKCfkPN/6KGxeKuBJc+DdO2uCj9cDAmXPu5d573JBRqSzr08gtLC4tr+RXC2vrG5tb5vZOQwaRwKSOAxaIloskYZSTuqKKkVYoCPJdRpru6CL1m7dESBrwazUOSddHfU49ipHSkmNWOj5SA9eL75KbuEQPEyceJvAM/sjOEB59f6PEoY5ZtMrWBHCe2Bkpggw1x3zv9AIc+YQrzJCUbdsKVTdGQlHMSFLoRJKECI9Qn7Q15cgnshtPbkvggVZ60AuEflzBifq7I0a+lGPf1ZXpjnLWS8X/vHakvNNuTHkYKcLxdJAXMagCmAYFe1QQrNhYE4QF1btCPEACYaXjLOgQ7NmT50mjUratsn11XKyeZ3HkwR7YByVggxNQBZegBuoAg3vwCJ7Bi/FgPBmvxtu0NGdkPbvgD4yPL2SEoZ4=</latexit> <latexit sha1_base64="ice+/tJd2u0YbJ7CSX4HnNpOHWE=">AAACG3icbVDLSsNAFJ3UV62vqEs3g0WoCCUpgm6EohuXFewD2hom00k77WQSZiZKCfkPN/6KGxeKuBJc+DdO2uCj9cDAmXPu5d573JBRqSzr08gtLC4tr+RXC2vrG5tb5vZOQwaRwKSOAxaIloskYZSTuqKKkVYoCPJdRpru6CL1m7dESBrwazUOSddHfU49ipHSkmNWOj5SA9eL75KbuEQPEyceJvAM/sjOEB59f6PEoY5ZtMrWBHCe2Bkpggw1x3zv9AIc+YQrzJCUbdsKVTdGQlHMSFLoRJKECI9Qn7Q15cgnshtPbkvggVZ60AuEflzBifq7I0a+lGPf1ZXpjnLWS8X/vHakvNNuTHkYKcLxdJAXMagCmAYFe1QQrNhYE4QF1btCPEACYaXjLOgQ7NmT50mjUratsn11XKyeZ3HkwR7YByVggxNQBZegBuoAg3vwCJ7Bi/FgPBmvxtu0NGdkPbvgD4yPL2SEoZ4=</latexit> <latexit sha1_base64="ice+/tJd2u0YbJ7CSX4HnNpOHWE=">AAACG3icbVDLSsNAFJ3UV62vqEs3g0WoCCUpgm6EohuXFewD2hom00k77WQSZiZKCfkPN/6KGxeKuBJc+DdO2uCj9cDAmXPu5d573JBRqSzr08gtLC4tr+RXC2vrG5tb5vZOQwaRwKSOAxaIloskYZSTuqKKkVYoCPJdRpru6CL1m7dESBrwazUOSddHfU49ipHSkmNWOj5SA9eL75KbuEQPEyceJvAM/sjOEB59f6PEoY5ZtMrWBHCe2Bkpggw1x3zv9AIc+YQrzJCUbdsKVTdGQlHMSFLoRJKECI9Qn7Q15cgnshtPbkvggVZ60AuEflzBifq7I0a+lGPf1ZXpjnLWS8X/vHakvNNuTHkYKcLxdJAXMagCmAYFe1QQrNhYE4QF1btCPEACYaXjLOgQ7NmT50mjUratsn11XKyeZ3HkwR7YByVggxNQBZegBuoAg3vwCJ7Bi/FgPBmvxtu0NGdkPbvgD4yPL2SEoZ4=</latexit>
  20. Proposed method: Personalisation • To deal with learning on the

    large vocabulary, hierarchical softmax with the Huffman tree built based on word frequencies is used • Parameters in optimisation function (J1), word vectors, and user vectors are updated by Stochastic Gradient Descent (SGD) 21 • Overview • Proposed method ‣ Personalisation ‣ Socialisation • Evaluation • Comments
  21. Proposed method: Socialisation • Homophily in social networks (Lazarsfeld and

    Merton, 1954; McPherson et al., 2001) ‣ someone’s friends tend to share similar opinions or topics • User vectors in friendship (neighbours) should be similar ‣ Minimise: ‣ SGD is also applied here to update user vectors 22 • Overview • Proposed method ‣ Personalisation ‣ Socialisation • Evaluation • Comments
  22. • Incorporating socialisation makes user vectors being updated more frequently

    than word vectors • Introduce a constraint for user vectors’ L2- norm 23 Proposed method: Socialisation • Overview • Proposed method ‣ Personalisation ‣ Socialisation • Evaluation • Comments
  23. 24 Dataset • Overview • Proposed method • Evaluation ‣

    Dataset ‣ Perplexity ‣ Sentiment Classification • Comments
  24. 25 Perplexity • 6-gram perplexity • Changing the importance of

    social regularisation (λ) and the strength of personalisation (r) ‣ From the shape of curves, they can be optimised for the given dataset
  25. Sentiment Classification • Use socialised word embeddings to a downstream

    task ‣ They chose document-level sentiment classification (to predict ratings of Yelp reviews) • Check two aspects: ‣ User segmentation (active users or not) ‣ Applicability for attention vectors in neural models 26 • Overview • Proposed method • Evaluation ‣ Dataset ‣ Perplexity ‣ Sentiment Classification • Comments
  26. 27 Sentiment Classification: SVM and user segmentation • Overview •

    Proposed method • Evaluation ‣ Dataset ‣ Perplexity ‣ Sentiment Classification • Comments • Split users by published reviews ‣ Total num. of reviews are the same in both segments • Use the average of word vectors in a document as features for SVM
  27. Sentiment Classification: NN models and user attention In Yelp review

    prediction task, some papers proposed neural models that apply (extra-)attention mechanism to users 28 • Overview • Proposed method • Evaluation ‣ Dataset ‣ Perplexity ‣ Sentiment Classification • Comments (e.g.) Chen et al., 2016
  28. Sentiment Classification: NN models and user attention • How about

    using socialised word embeddings as “fixed” attention vectors in those models? • Better than without attention but slightly worse than original models (with attention vector trained) 29 • Overview • Proposed method • Evaluation ‣ Dataset ‣ Perplexity ‣ Sentiment Classification • Comments
  29. Comments • Socialisation looks a very interesting/promising idea ‣ Not

    significant performance gains though ‣ How to regularise sociality has room for improvement ✦ Regarding neighbour users should-be similar seems too naive and strong 30 • Overview • Proposed method • Evaluation • Comments
  30. Comments • Their source code is available: ‣ https://github.com/HKUST-KnowComp/ SocializedWordEmbeddings

    • They have just published improved version of socialised word embeddings this year ‣ https://github.com/HKUST-KnowComp/SRBRW 31 • Overview • Proposed method • Evaluation • Comments @inproceedings{zeng2018biased, title={Biased Random Walk based Social Regularization for Word Embeddings}, author={Zeng, Ziqian and Liu, Xin and Song, Yangqiu}, booktitle={IJCAI}, pages={XX-YY}, year={2018}, }
  31. Deep contextualized word representations Matthew E. Peters, Mark Neumann, Mohit

    Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer NAACL-HLT 2018
  32. Overview • Propose a new type of deep contextualised word

    representations (ELMo) that model: ‣ Complex characteristics of word use (e.g., syntax and semantics) ‣ How these uses vary across linguistic contexts (i.e., to model polysemy) • Show that ELMo can improve existing neural models in various NLP tasks • Argue that ELMo can capture more abstract linguistic characteristics in the higher level of layers 33 • Overview • Method • Evaluation • Analysis • Comments
  33. Method • Embeddings from Language Models: ELMo • Learn word

    embeddings through building bidirectional language models (biLMs) ‣ biLMs consist of forward and backward LMs ✦ Forward: ✦ Backward: 35 • Overview • Method • Evaluation • Analysis • Comments p(t1 , t2 , …, tN ) = N ∏ k=1 p(tk |t1 , t2 , …, tk−1 ) p(t1 , t2 , …, tN ) = N ∏ k=1 p(tk |tk+1 , tk+2 , …, tN )
  34. Method With long short term memory (LSTM) network, predicting the

    next words in both directions to build biLMs 36 have a nice one … Output layer Hidden layers (LSTMs) Embedding layer … a nice one xk ok k − 1 k − 1 Expanded in the forward direction of k The forward LM architecture • Overview • Method • Evaluation • Analysis • Comments tk hLM k1 hLM k2
  35. Method ELMo represents a word ɹ as a linear combination

    of corresponding hidden layers (inc. its embedding) 37 tk xk ok k − 1 k − 1 hLM k1 ok k + 1 k + 1 hLM k2 hLM k1 hLM k2 Forward LM Backward LM tk tk stask 0 stask 1 stask 2 Concatenate hidden layers hLM k1 hLM k2 hLM k0 ([xk ; xk ]) × × × [hLM kj ; hLM kj ] { ELMotask k × γtask ∑ = Unlike usual word embeddings, ELMo is assigned to every token instead of a type biLMs ELMo is a task specific representation. A down-stream task learns weighting parameters
  36. ELMo can be integrated to almost all neural NLP tasks

    with simple concatenation to the embedding layer Method 38 • Overview • Method • Evaluation • Analysis • Comments have a nice ELMo ELMo ELMo biLMs Corpus Usual inputs Enhance inputs with ELMos Train
  37. Many linguistic tasks are improved by using ELMo 39 Evaluation

    • Overview • Method • Evaluation • Analysis • Comments Q&A Textual entailment Sentiment analysis Semantic role labelling Coreference resolution Named entity recognition
  38. Analysis The higher layer seemed to learn semantics while the

    lower layer probably captured syntactic features 40 • Overview • Method • Evaluation • Analysis • Comments Word sense disambiguation PoS tagging
  39. Analysis The higher layer seemed to learn semantics while the

    lower layer probably captured syntactic features??? 41 • Overview • Method • Evaluation • Analysis • Comments Most models preferred “syntactic (probably)” features Even in sentiment analysis
  40. Analysis ELMo-enhanced models can make use of small datasets more

    efficiently 42 • Overview • Method • Evaluation • Analysis • Comments Textual entailment Semantic role labelling
  41. Comments • Pre-trained ELMo models are available at https:// allennlp.org/elmo

    ‣ AllenNLP is a deep NLP library on top of PyTorch ‣ AllenNLP is a product of AI2 (Allen Institute for Artificial Intelligence) which works on other interesting projects like Semantic Scholar • ELMo can process character-level inputs ‣ Japanese (Chinese, Korean, …) ELMo models likely to be possible 43 • Overview • Method • Evaluation • Analysis • Comments