Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Recent advances in natural language understandi...

Recent advances in natural language understanding and natural language generation

Introduction to recent advances in natural language processing, especially focusing on deep learning methods.

Invited talk at IEEE 5th ICOIACT 2022.

Mamoru Komachi

August 24, 2022
Tweet

More Decks by Mamoru Komachi

Other Decks in Research

Transcript

  1. Recent advances in natural language understanding and natural language generation

    Mamoru Komachi Tokyo Metropolitan University IEEE 5th ICOIACT 2022 August 24, 2022
  2. Short bio • -2005.03 The University of Tokyo (B.L.A.) “The

    Language Policy in Colonial Taiwan” • 2005.04-2010.03 Nara Institute of Science and Technology (D.Eng.) “Graph-Theoretic Approaches to Minimally-Supervised Natural Language Learning” • 2010.04-2013.03 Nara Institute of Science and Technology (Assistant Prof.) • 2013.04-present Tokyo Metropolitan University (Associate Prof. → Prof.) 2
  3. Overview 1. Introduction 2. Natural language understanding via deep learning

    3. Natural language generation by pre-trained models 4. Multimodal natural language processing 5. Open problems in deep learning era 6. Summary 3
  4. History of natural language processing Machine translation Artificial intelligence Advances

    in statistical methods 4 ELIZA 1950-60s 1970-80s 1990-2000s
  5. Classical view of NLP in translation • Analyze the source

    text (input) and generate the target text (output) • Mainstream of (commercial) translation models by 1990s 5 CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=683855 Bernard Vauquoisʼ Pyramid
  6. Progress of statistical machine translation after 1990s 1. Translation model

    2. Open-source toolkits 3. Evaluation method 4. Optimization 5. Parallel corpora 6 IBM models (1993) to phrase-based methods (2003) GIZA++ (1999) Moses (2003) BLEU: Bilingual Language Evaluation Understudy (2002) Minimum error rate training (2003) Europarl (2005) UN parallel corpus (2016) Canadian parliamentary proceedings (Hansard)
  7. ⽬的 ⾔語 Statistical machine translation: statistical learning from a parallel

    corpus Noisy channel model 7 ˆ e = argmax e P(e | f ) = argmax e P( f | e)P(e) Parallel corpus Source Target ① Translation model ② Language model P(f |e) TM Target Target P(e) LM ③ Decoding Beam search ④ Optimization BLEU Source Target Raw corpus Evaluation corpus (reference)
  8. The larger the training data, the better the performance of

    the statistical model 8 Low Performance High Small Data Large Brants et al. Large Language Models in Machine Translation. EMNLP 2007. Performance linearly increases in a log scale
  9. Problems in statistical methods • Needs a large-scale dataset for

    each target task (e.g., machine translation, dialogue, summarization, …) • Generated texts are not fluent (can be recognized as machine-generated texts easily) • Not easy to integrate multiple modalities (speech, vision, …) 9
  10. Overview 1. Introduction 2. Natural language understanding via deep learning

    3. Natural language generation by pre-trained models 4. Multimodal natural language processing 5. Open problems in deep learning era 6. Summary 10
  11. Mathematical modeling of a language What is the meaning of

    king ? • King Arthur is a legendary British leader of the late 5th and early 6th centuries • Edward VIII was King of the United Kingdom and the Dominions of the British Empire, … A word can be characterized by its surrounding words à (distributional hypothesis) “You shall know a word by the company it keeps.” (Firth, 1957) 11
  12. king = (545, 0, 2330, 689, 799, 1100)T How often

    it co-occurs with “empire” How often it co-occurs with “man” …… Similarity of words (vectors) Similarity = cos θ = king!queen king |queen| Representing the meaning of a word by using a vector (vector space model) 12 empire king man queen rule woman empire 545 512 195 276 374 king 545 2330 689 799 1100 man 512 2330 915 593 2620 queen 195 689 915 448 708 rule 276 799 593 448 1170 woman 374 1100 2620 708 1170
  13. word2vec: Learning word vectors by self- supervision Learn word vectors

    (embeddings) by training a neural network for predicting their context à Predict surrounding words by the dot product of vectors à Negative examples are obtained by randomly replacing the context words (self-supervised learning) • She was the mother of Queen Elizabeth II . 13 (Mikolov et al., NeurIPS 2013)
  14. Word representations can quantitively evaluate the usage/meaning of words Visualization

    of word vectors considering the usage of English learners (t-SNE) • Red: word vectors trained on a native speaker corpus • Blue: word vectors trained on a English learner corpus 14 (Kaneko et al., IJCNLP 2017)
  15. word2vec enables algebraic manipulation of word meanings (additive compositionality) The

    relationship between countries and they capital cities is preserved in the vector space learned by word2vec (Mikolov et al., 2013) à Algebraic operation such as “king ‒ man + woman = queen” can be performed 15
  16. Are vectors are appropriate for word representation? Adjective à Matrix?

    • Baroni and Zamperelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. EMNLP. Transitive verb à Tensor? • Socher et al. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. EMNLP. 16 Baroni and Zamperelli (2010)
  17. Is it possible to obtain a word vector in a

    sentence considering its context? BERT: Bidirectional Transformers for Language Understanding (2019) à Self-supervised learning of contextualized word vectors à Modeling contexts by Transformer cells 17
  18. Structured data can be extracted from the images of receipts

    and invoices Learn a vector considering the relative position in a picture • Majumder et al. 2020. Representation Learning for Information Extraction from Form-like Documents. ACL. 18 https://ai.googleblog.com/2020/06/extracting-structured-data-from.html
  19. Progress of language understanding by pre-trained language models Pre-train language

    models by large-scale data (+ fine-tune on small-scale data) à Huge improvements in many NLU tasks • 2018/11 Google BERT (NAACL 2019) • 2019/01 Facebook XLM (EMNLP 2019) • 2019/01 Microsoft MT-DNN (ACL 2019) • 2019/06 Google XLNet (NeurIPS 2019) 19 BERT XLM MT-DNN XLNet Human Sentiment analysis (SST-2) 94.9 95.6 97.5 97.1 97.8 Paraphrase (MRPC) 89.3 90.7 93.7 92.9 86.3 Inference (MNLI-m/mm) 86.7/85.9 89.1/88.5 91.0/90.8 91.3/91.0 92.0/92.8 https://gluebenchmark.com/leaderboard m: match; mm: mis-match
  20. BERT is also effective for numerical data stored in a

    table format Pre-training can be applied to table-like datasets • Herzig et al. 2020. TaPas: Weakly Supervised Table Parsing via Pre-training. ACL. 20 https://ai.googleblog.com/2020/04/using-neural-networks-to-find-answers.html
  21. BERT can be improved by pre-training using task-specific datasets •

    Liu et al. 2020. FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining. IJCAI. • Chalkiodis et al. 2020. LEGAL-BERT: The Muppets straight out of Law School. EMNLP. 21 ↑ LEGAL-BERT ← FinBERT
  22. BERT encodes grammar and meaning Syntax and semantics • Coenen

    et al. 2019. Visualizing and Measuring the Geometry of BERT. NeurIPS. Cross-lingual grammar • Chi et al. 2020. Finding Grammatical Relations in Multilingual BERT. ACL. 22
  23. Problems in statistical methods • Needs a large-scale dataset for

    each target task (e.g., machine translation, dialogue, summarization, …) à Self-supervised learning can be used in pre-training • Generated texts are not fluent (can be recognized as machine-generated texts easily) • Not easy to integrate multiple modalities (speech, vision, …) 23
  24. Overview 1. Introduction 2. Natural language understanding via deep learning

    3. Natural language generation by pre-trained models 4. Multimodal natural language processing 5. Open problems in deep learning era 6. Summary 24
  25. Progress of neural machine translation after 2010s 1. Encoder-decoder architecture

    2. Attention mechanism 3. Cross-lingual methods 4. Zero-shot translation 25 Combination of two neural network models (2013) Dynamically determines contextual information (2015) Self-attention network (2017) Google NMT (2016) OpenAI GPT-3 (2020) Language-independent subwords (2016) Joint learning of multilingual models (2017) Pre-training of encoder-decoders (2019)
  26. Encoder-decoder models for language (understanding and) generation Combines two neural

    networks • Sutskever et al. 2014. Sequence to Sequence Learning with Neural Networks. NeurIPS. 26 深層学習 マジ やばい </s> DL is DL is really cool really cool </s> encoder decoder Encode a sentence vector from source word vectors
  27. Decoding target texts by attending the source texts (context vectors)

    Weighted sum of hidden source vectors • Bahdanau et al. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR. 27 深層学習 マジ やばい </s> DL is is really cool really cool </s> Using combination of context (word) vectors instead of using a single sentence vector DL </s>
  28. Transformer attends both encoder and decoder side sequences Fluent output

    is obtained by generating the words one by one (Vaswani et al., 2017) 今⽇ 暑い です ね encoder hidden <⽂頭> it is it is decoder
  29. Multilingual encoders can learn language independent sentence representation Translation without

    a parallel corpus of the target language • Johnson et al. 2017. Googleʼs Machine Translation System: Enabling Zero-Shot Translation. TACL. 30 Visualization of sentence vectors with the same meaning (Colors represent the meaning of the sentences)
  30. Encoder-decoders enable sequence-to- sequence transformation in any forms • Zaremba

    and Sutskever. 2015. Learning to Execute. à Learns an interpreter of a Python-like language • Vinyals et al. 2015. Show and Tell: A Neural Image Caption Generator. à Generate texts from an image 31
  31. Pre-trained models are not good at making (inter-sentential) inference •

    Parikh et al. 2020. ToTTo: A Controlled Table-to-Text Generation Dataset. EMNLP. à Dataset for generating texts from tables 32 https://ai.googleblog.com/2021/01/totto-controlled-table-to-text.html
  32. GPT: Generative Pre-trained Transformer Performances of many tasks improve by

    a large decoder • Brown et al. 2020. Language Models are Few-Shot Learners. NeurIPS. 33
  33. GPT (w/ Transformer) learns from large- scale data with the

    help of massive model Transformer is Turing complete • Yun et al. 2020. Are Transformers Universal Approximators of Sequence-to-sequence Functions? ICLR. 34 (Brown et al., 2020)
  34. Prompt engineering is a new paradigm for AI in interaction

    with language 35 (Kojima et al., 2022)
  35. Problems in statistical methods • Needs a large-scale dataset for

    each target task (e.g., machine translation, dialogue, summarization, …) à Self-supervised learning can be used in pre-training • Generated texts are not fluent (can be recognized as machine-generated texts easily) à Encoder-decoder models boost fluency of text generation • Not easy to integrate multiple modalities (speech, vision, …) 36
  36. Overview 1. Introduction 2. Natural language understanding via deep learning

    3. Natural language generation by pre-trained models 4. Multimodal natural language processing 5. Open problems in deep learning era 6. Summary 37
  37. Multimodal extension: Interaction between vision and language by deep learning

    Language Transformer (SAN) Speech Sequence (RNN/CTC) Image Convolution (CNN) 38 CNN: Convolutional Neural Network RNN: Recurrent Neural Network CTC: Connectionist Temporal Classification SAN: Self-Attention Network
  38. Deep learning advanced object detection and semantic segmentation Extract rectangle

    regions and assign categories (semantic segmentation) • Girshick et al. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR. 39
  39. Attention mechanism is also one of the key components in

    object detection Object detection improved by attending regions from CNN • Ren et al. 2016. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. PAMI. 40
  40. From semantic (rectangle region) segmentation to instance segmentation Instance segmentation

    by masking output of Faster R-CNN • He et al. 2017. Mask R-CNN. ICCV. 41
  41. Image captioning by combining vision (CNN) and language (RNN) networks

    Separate models for vision and language • Xu et al. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. CVPR. 42 Visualization of the attention during generation of each word in image captioning (top: soft-attention, bottom: hard-attention)
  42. Fusing vision and language by attention Integration of the output

    of Faster R-CNN by attention • Anderson et al. 2018. Bottom-Up and Top-Down Attention for Visual Question Answering. CVPR. 43
  43. Visual information helps neural machine translation Attending the object region

    during translation • Zhao et al. 2021. Region-Attentive Multimodal Neural Machine Translation. Neurocomputing. 44
  44. GPT can generate images from texts GPT learns to generate

    images just like texts • Ramesh et al. 2021. Zero-Shot Text-to-Image Generation. 45 an illustration of a baby daikon radish in a tutu walking a dog (https://openai.com/blog/dall-e/)
  45. 46

  46. Problems in statistical methods • Needs a large-scale dataset for

    each target task (e.g., machine translation, dialogue, summarization, …) à Self-supervised learning can be used in pre-training • Generated texts are not fluent (can be recognized as machine-generated texts easily) à Encoder-decoder models boost fluency of text generation • Not easy to integrate multiple modalities (speech, vision, …) à Transformers can unify input/output of any modalities 47
  47. Overview 1. Introduction 2. Natural language understanding via deep learning

    3. Natural language generation by pre-trained models 4. Multimodal natural language processing 5. Open problems in deep learning era 6. Summary 48
  48. Problems in pre-trained models What does BERT understand? (BERTology) •

    Not good at sequence modeling with multiple actions • Not good at dealing with knowledge based on real-world experience Bias in pre-trained models • Bias from data • Bias from models 49 Sentiment analysis via GPT-3 by generating texts from “The {race} man/woman was very …” (Brown et al., 2020)
  49. Problems in language generation Evaluation of the generated texts •

    Human evaluation and automatic evaluation (Humans prefer fluent texts over adequate ones) • Meta-evaluation or evaluation methodology (Data leakage may occur) Treatment of the generated texts • Copyright (data source and output) • Ethical consideration (socio-political issue) 50
  50. Summary • Pre-trained models can solve the data acquisition bottleneck

    • Encoder-decoder models enable fluent language generation • Deep learning opens the possibility of integrating multiple modalities 51