Recent advances in natural language understanding and natural language generation

Slide 1

Slide 1 text

Recent advances in natural language understanding and natural language generation Mamoru Komachi Tokyo Metropolitan University IEEE 5th ICOIACT 2022 August 24, 2022

Slide 2

Slide 2 text

Short bio • -2005.03 The University of Tokyo (B.L.A.) “The Language Policy in Colonial Taiwan” • 2005.04-2010.03 Nara Institute of Science and Technology (D.Eng.) “Graph-Theoretic Approaches to Minimally-Supervised Natural Language Learning” • 2010.04-2013.03 Nara Institute of Science and Technology (Assistant Prof.) • 2013.04-present Tokyo Metropolitan University (Associate Prof. → Prof.) 2

Slide 3

Slide 3 text

Overview 1. Introduction 2. Natural language understanding via deep learning 3. Natural language generation by pre-trained models 4. Multimodal natural language processing 5. Open problems in deep learning era 6. Summary 3

Slide 4

Slide 4 text

History of natural language processing Machine translation Artificial intelligence Advances in statistical methods 4 ELIZA 1950-60s 1970-80s 1990-2000s

Slide 5

Slide 5 text

Classical view of NLP in translation • Analyze the source text (input) and generate the target text (output) • Mainstream of (commercial) translation models by 1990s 5 CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=683855 Bernard Vauquoisʼ Pyramid

Slide 6

Slide 6 text

Progress of statistical machine translation after 1990s 1. Translation model 2. Open-source toolkits 3. Evaluation method 4. Optimization 5. Parallel corpora 6 IBM models (1993) to phrase-based methods (2003) GIZA++ (1999) Moses (2003) BLEU: Bilingual Language Evaluation Understudy (2002) Minimum error rate training (2003) Europarl (2005) UN parallel corpus (2016) Canadian parliamentary proceedings (Hansard)

Slide 7

Slide 7 text

⽬的⾔語 Statistical machine translation: statistical learning from a parallel corpus Noisy channel model 7 ˆ e = argmax e P(e | f ) = argmax e P( f | e)P(e) Parallel corpus Source Target ① Translation model ② Language model P(f |e) TM Target Target P(e) LM ③ Decoding Beam search ④ Optimization BLEU Source Target Raw corpus Evaluation corpus (reference)

Slide 8

Slide 8 text

The larger the training data, the better the performance of the statistical model 8 Low Performance High Small Data Large Brants et al. Large Language Models in Machine Translation. EMNLP 2007. Performance linearly increases in a log scale

Slide 9

Slide 9 text

Problems in statistical methods • Needs a large-scale dataset for each target task (e.g., machine translation, dialogue, summarization, …) • Generated texts are not fluent (can be recognized as machine-generated texts easily) • Not easy to integrate multiple modalities (speech, vision, …) 9

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Mathematical modeling of a language What is the meaning of king ? • King Arthur is a legendary British leader of the late 5th and early 6th centuries • Edward VIII was King of the United Kingdom and the Dominions of the British Empire, … A word can be characterized by its surrounding words à (distributional hypothesis) “You shall know a word by the company it keeps.” (Firth, 1957) 11

Slide 12

Slide 12 text

king = (545, 0, 2330, 689, 799, 1100)T How often it co-occurs with “empire” How often it co-occurs with “man” …… Similarity of words (vectors) Similarity = cos θ = king!queen king |queen| Representing the meaning of a word by using a vector (vector space model) 12 empire king man queen rule woman empire 545 512 195 276 374 king 545 2330 689 799 1100 man 512 2330 915 593 2620 queen 195 689 915 448 708 rule 276 799 593 448 1170 woman 374 1100 2620 708 1170

Slide 13

Slide 13 text

word2vec: Learning word vectors by self- supervision Learn word vectors (embeddings) by training a neural network for predicting their context à Predict surrounding words by the dot product of vectors à Negative examples are obtained by randomly replacing the context words (self-supervised learning) • She was the mother of Queen Elizabeth II . 13 (Mikolov et al., NeurIPS 2013)

Slide 14

Slide 14 text

Word representations can quantitively evaluate the usage/meaning of words Visualization of word vectors considering the usage of English learners (t-SNE) • Red: word vectors trained on a native speaker corpus • Blue: word vectors trained on a English learner corpus 14 (Kaneko et al., IJCNLP 2017)

Slide 15

Slide 15 text

word2vec enables algebraic manipulation of word meanings (additive compositionality) The relationship between countries and they capital cities is preserved in the vector space learned by word2vec (Mikolov et al., 2013) à Algebraic operation such as “king ‒ man + woman = queen” can be performed 15

Slide 16

Slide 16 text

Are vectors are appropriate for word representation? Adjective à Matrix? • Baroni and Zamperelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. EMNLP. Transitive verb à Tensor? • Socher et al. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. EMNLP. 16 Baroni and Zamperelli (2010)

Slide 17

Slide 17 text

Is it possible to obtain a word vector in a sentence considering its context? BERT: Bidirectional Transformers for Language Understanding (2019) à Self-supervised learning of contextualized word vectors à Modeling contexts by Transformer cells 17

Slide 18

Slide 18 text

Structured data can be extracted from the images of receipts and invoices Learn a vector considering the relative position in a picture • Majumder et al. 2020. Representation Learning for Information Extraction from Form-like Documents. ACL. 18 https://ai.googleblog.com/2020/06/extracting-structured-data-from.html

Slide 19

Slide 19 text

Progress of language understanding by pre-trained language models Pre-train language models by large-scale data (+ fine-tune on small-scale data) à Huge improvements in many NLU tasks • 2018/11 Google BERT (NAACL 2019) • 2019/01 Facebook XLM (EMNLP 2019) • 2019/01 Microsoft MT-DNN (ACL 2019) • 2019/06 Google XLNet (NeurIPS 2019) 19 BERT XLM MT-DNN XLNet Human Sentiment analysis (SST-2) 94.9 95.6 97.5 97.1 97.8 Paraphrase (MRPC) 89.3 90.7 93.7 92.9 86.3 Inference (MNLI-m/mm) 86.7/85.9 89.1/88.5 91.0/90.8 91.3/91.0 92.0/92.8 https://gluebenchmark.com/leaderboard m: match; mm: mis-match

Slide 20

Slide 20 text

BERT is also effective for numerical data stored in a table format Pre-training can be applied to table-like datasets • Herzig et al. 2020. TaPas: Weakly Supervised Table Parsing via Pre-training. ACL. 20 https://ai.googleblog.com/2020/04/using-neural-networks-to-find-answers.html

Slide 21

Slide 21 text

BERT can be improved by pre-training using task-specific datasets • Liu et al. 2020. FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining. IJCAI. • Chalkiodis et al. 2020. LEGAL-BERT: The Muppets straight out of Law School. EMNLP. 21 ↑ LEGAL-BERT ← FinBERT

Slide 22

Slide 22 text

BERT encodes grammar and meaning Syntax and semantics • Coenen et al. 2019. Visualizing and Measuring the Geometry of BERT. NeurIPS. Cross-lingual grammar • Chi et al. 2020. Finding Grammatical Relations in Multilingual BERT. ACL. 22

Slide 23

Slide 23 text

Problems in statistical methods • Needs a large-scale dataset for each target task (e.g., machine translation, dialogue, summarization, …) à Self-supervised learning can be used in pre-training • Generated texts are not fluent (can be recognized as machine-generated texts easily) • Not easy to integrate multiple modalities (speech, vision, …) 23

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Progress of neural machine translation after 2010s 1. Encoder-decoder architecture 2. Attention mechanism 3. Cross-lingual methods 4. Zero-shot translation 25 Combination of two neural network models (2013) Dynamically determines contextual information (2015) Self-attention network (2017) Google NMT (2016) OpenAI GPT-3 (2020) Language-independent subwords (2016) Joint learning of multilingual models (2017) Pre-training of encoder-decoders (2019)

Slide 26

Slide 26 text

Encoder-decoder models for language (understanding and) generation Combines two neural networks • Sutskever et al. 2014. Sequence to Sequence Learning with Neural Networks. NeurIPS. 26 深層学習マジやばい DL is DL is really cool really cool encoder decoder Encode a sentence vector from source word vectors

Slide 27

Slide 27 text

Decoding target texts by attending the source texts (context vectors) Weighted sum of hidden source vectors • Bahdanau et al. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR. 27 深層学習マジやばい DL is is really cool really cool Using combination of context (word) vectors instead of using a single sentence vector DL

Slide 28

Slide 28 text

Transformer attends both encoder and decoder side sequences Fluent output is obtained by generating the words one by one (Vaswani et al., 2017) 今⽇暑いですね encoder hidden <⽂頭> it is it is decoder

Slide 29

Slide 29 text

Encoder-decoder models with attention enable fluent language generation 29 https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html Achieved human parity in Chinese- English translation (Microsoft 2018)

Slide 30

Slide 30 text

Multilingual encoders can learn language independent sentence representation Translation without a parallel corpus of the target language • Johnson et al. 2017. Googleʼs Machine Translation System: Enabling Zero-Shot Translation. TACL. 30 Visualization of sentence vectors with the same meaning (Colors represent the meaning of the sentences)

Slide 31

Slide 31 text

Encoder-decoders enable sequence-to- sequence transformation in any forms • Zaremba and Sutskever. 2015. Learning to Execute. à Learns an interpreter of a Python-like language • Vinyals et al. 2015. Show and Tell: A Neural Image Caption Generator. à Generate texts from an image 31

Slide 32

Slide 32 text

Pre-trained models are not good at making (inter-sentential) inference • Parikh et al. 2020. ToTTo: A Controlled Table-to-Text Generation Dataset. EMNLP. à Dataset for generating texts from tables 32 https://ai.googleblog.com/2021/01/totto-controlled-table-to-text.html

Slide 33

Slide 33 text

GPT: Generative Pre-trained Transformer Performances of many tasks improve by a large decoder • Brown et al. 2020. Language Models are Few-Shot Learners. NeurIPS. 33

Slide 34

Slide 34 text

GPT (w/ Transformer) learns from large- scale data with the help of massive model Transformer is Turing complete • Yun et al. 2020. Are Transformers Universal Approximators of Sequence-to-sequence Functions? ICLR. 34 (Brown et al., 2020)

Slide 35

Slide 35 text

Prompt engineering is a new paradigm for AI in interaction with language 35 (Kojima et al., 2022)

Slide 36

Slide 36 text

Problems in statistical methods • Needs a large-scale dataset for each target task (e.g., machine translation, dialogue, summarization, …) à Self-supervised learning can be used in pre-training • Generated texts are not fluent (can be recognized as machine-generated texts easily) à Encoder-decoder models boost fluency of text generation • Not easy to integrate multiple modalities (speech, vision, …) 36

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Multimodal extension: Interaction between vision and language by deep learning Language Transformer (SAN) Speech Sequence (RNN/CTC) Image Convolution (CNN) 38 CNN: Convolutional Neural Network RNN: Recurrent Neural Network CTC: Connectionist Temporal Classification SAN: Self-Attention Network

Slide 39

Slide 39 text

Deep learning advanced object detection and semantic segmentation Extract rectangle regions and assign categories (semantic segmentation) • Girshick et al. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR. 39

Slide 40

Slide 40 text

Attention mechanism is also one of the key components in object detection Object detection improved by attending regions from CNN • Ren et al. 2016. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. PAMI. 40

Slide 41

Slide 41 text

From semantic (rectangle region) segmentation to instance segmentation Instance segmentation by masking output of Faster R-CNN • He et al. 2017. Mask R-CNN. ICCV. 41

Slide 42

Slide 42 text

Image captioning by combining vision (CNN) and language (RNN) networks Separate models for vision and language • Xu et al. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. CVPR. 42 Visualization of the attention during generation of each word in image captioning (top: soft-attention, bottom: hard-attention)

Slide 43

Slide 43 text

Fusing vision and language by attention Integration of the output of Faster R-CNN by attention • Anderson et al. 2018. Bottom-Up and Top-Down Attention for Visual Question Answering. CVPR. 43

Slide 44

Slide 44 text

Visual information helps neural machine translation Attending the object region during translation • Zhao et al. 2021. Region-Attentive Multimodal Neural Machine Translation. Neurocomputing. 44

Slide 45

Slide 45 text

GPT can generate images from texts GPT learns to generate images just like texts • Ramesh et al. 2021. Zero-Shot Text-to-Image Generation. 45 an illustration of a baby daikon radish in a tutu walking a dog (https://openai.com/blog/dall-e/)

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Problems in statistical methods • Needs a large-scale dataset for each target task (e.g., machine translation, dialogue, summarization, …) à Self-supervised learning can be used in pre-training • Generated texts are not fluent (can be recognized as machine-generated texts easily) à Encoder-decoder models boost fluency of text generation • Not easy to integrate multiple modalities (speech, vision, …) à Transformers can unify input/output of any modalities 47

Slide 48

Slide 48 text

Slide 49

Slide 49 text

Problems in pre-trained models What does BERT understand? (BERTology) • Not good at sequence modeling with multiple actions • Not good at dealing with knowledge based on real-world experience Bias in pre-trained models • Bias from data • Bias from models 49 Sentiment analysis via GPT-3 by generating texts from “The {race} man/woman was very …” (Brown et al., 2020)

Slide 50

Slide 50 text

Problems in language generation Evaluation of the generated texts • Human evaluation and automatic evaluation (Humans prefer fluent texts over adequate ones) • Meta-evaluation or evaluation methodology (Data leakage may occur) Treatment of the generated texts • Copyright (data source and output) • Ethical consideration (socio-political issue) 50

Slide 51

Slide 51 text

Summary • Pre-trained models can solve the data acquisition bottleneck • Encoder-decoder models enable fluent language generation • Deep learning opens the possibility of integrating multiple modalities 51