Recent advances in natural language understanding and natural language generation

Recent advances in natural language understanding and natural language generation
Mamoru Komachi Tokyo Metropolitan University IEEE 5th ICOIACT 2022 August 24, 2022

Short bio • -2005.03 The University of Tokyo (B.L.A.) “The
Language Policy in Colonial Taiwan” • 2005.04-2010.03 Nara Institute of Science and Technology (D.Eng.) “Graph-Theoretic Approaches to Minimally-Supervised Natural Language Learning” • 2010.04-2013.03 Nara Institute of Science and Technology (Assistant Prof.) • 2013.04-present Tokyo Metropolitan University (Associate Prof. → Prof.) 2

Overview 1. Introduction 2. Natural language understanding via deep learning
3. Natural language generation by pre-trained models 4. Multimodal natural language processing 5. Open problems in deep learning era 6. Summary 3

History of natural language processing Machine translation Artificial intelligence Advances
in statistical methods 4 ELIZA 1950-60s 1970-80s 1990-2000s

Classical view of NLP in translation • Analyze the source
text (input) and generate the target text (output) • Mainstream of (commercial) translation models by 1990s 5 CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=683855 Bernard Vauquoisʼ Pyramid

Progress of statistical machine translation after 1990s 1. Translation model
2. Open-source toolkits 3. Evaluation method 4. Optimization 5. Parallel corpora 6 IBM models (1993) to phrase-based methods (2003) GIZA++ (1999) Moses (2003) BLEU: Bilingual Language Evaluation Understudy (2002) Minimum error rate training (2003) Europarl (2005) UN parallel corpus (2016) Canadian parliamentary proceedings (Hansard)

⽬的⾔語 Statistical machine translation: statistical learning from a parallel
corpus Noisy channel model 7 ˆ e = argmax e P(e | f ) = argmax e P( f | e)P(e) Parallel corpus Source Target ① Translation model ② Language model P(f |e) TM Target Target P(e) LM ③ Decoding Beam search ④ Optimization BLEU Source Target Raw corpus Evaluation corpus (reference)

The larger the training data, the better the performance of
the statistical model 8 Low Performance High Small Data Large Brants et al. Large Language Models in Machine Translation. EMNLP 2007. Performance linearly increases in a log scale

Problems in statistical methods • Needs a large-scale dataset for
each target task (e.g., machine translation, dialogue, summarization, …) • Generated texts are not fluent (can be recognized as machine-generated texts easily) • Not easy to integrate multiple modalities (speech, vision, …) 9

Mathematical modeling of a language What is the meaning of
king ? • King Arthur is a legendary British leader of the late 5th and early 6th centuries • Edward VIII was King of the United Kingdom and the Dominions of the British Empire, … A word can be characterized by its surrounding words à (distributional hypothesis) “You shall know a word by the company it keeps.” (Firth, 1957) 11

king = (545, 0, 2330, 689, 799, 1100)T How often
it co-occurs with “empire” How often it co-occurs with “man” …… Similarity of words (vectors) Similarity = cos θ = king!queen king |queen| Representing the meaning of a word by using a vector (vector space model) 12 empire king man queen rule woman empire 545 512 195 276 374 king 545 2330 689 799 1100 man 512 2330 915 593 2620 queen 195 689 915 448 708 rule 276 799 593 448 1170 woman 374 1100 2620 708 1170

word2vec: Learning word vectors by self- supervision Learn word vectors
(embeddings) by training a neural network for predicting their context à Predict surrounding words by the dot product of vectors à Negative examples are obtained by randomly replacing the context words (self-supervised learning) • She was the mother of Queen Elizabeth II . 13 (Mikolov et al., NeurIPS 2013)

Word representations can quantitively evaluate the usage/meaning of words Visualization
of word vectors considering the usage of English learners (t-SNE) • Red: word vectors trained on a native speaker corpus • Blue: word vectors trained on a English learner corpus 14 (Kaneko et al., IJCNLP 2017)

word2vec enables algebraic manipulation of word meanings (additive compositionality) The
relationship between countries and they capital cities is preserved in the vector space learned by word2vec (Mikolov et al., 2013) à Algebraic operation such as “king ‒ man + woman = queen” can be performed 15

Are vectors are appropriate for word representation? Adjective à Matrix?
• Baroni and Zamperelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. EMNLP. Transitive verb à Tensor? • Socher et al. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. EMNLP. 16 Baroni and Zamperelli (2010)

Is it possible to obtain a word vector in a
sentence considering its context? BERT: Bidirectional Transformers for Language Understanding (2019) à Self-supervised learning of contextualized word vectors à Modeling contexts by Transformer cells 17

Structured data can be extracted from the images of receipts
and invoices Learn a vector considering the relative position in a picture • Majumder et al. 2020. Representation Learning for Information Extraction from Form-like Documents. ACL. 18 https://ai.googleblog.com/2020/06/extracting-structured-data-from.html

Progress of language understanding by pre-trained language models Pre-train language
models by large-scale data (+ fine-tune on small-scale data) à Huge improvements in many NLU tasks • 2018/11 Google BERT (NAACL 2019) • 2019/01 Facebook XLM (EMNLP 2019) • 2019/01 Microsoft MT-DNN (ACL 2019) • 2019/06 Google XLNet (NeurIPS 2019) 19 BERT XLM MT-DNN XLNet Human Sentiment analysis (SST-2) 94.9 95.6 97.5 97.1 97.8 Paraphrase (MRPC) 89.3 90.7 93.7 92.9 86.3 Inference (MNLI-m/mm) 86.7/85.9 89.1/88.5 91.0/90.8 91.3/91.0 92.0/92.8 https://gluebenchmark.com/leaderboard m: match; mm: mis-match

BERT is also effective for numerical data stored in a
table format Pre-training can be applied to table-like datasets • Herzig et al. 2020. TaPas: Weakly Supervised Table Parsing via Pre-training. ACL. 20 https://ai.googleblog.com/2020/04/using-neural-networks-to-find-answers.html

BERT can be improved by pre-training using task-specific datasets •
Liu et al. 2020. FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining. IJCAI. • Chalkiodis et al. 2020. LEGAL-BERT: The Muppets straight out of Law School. EMNLP. 21 ↑ LEGAL-BERT ← FinBERT

BERT encodes grammar and meaning Syntax and semantics • Coenen
et al. 2019. Visualizing and Measuring the Geometry of BERT. NeurIPS. Cross-lingual grammar • Chi et al. 2020. Finding Grammatical Relations in Multilingual BERT. ACL. 22

each target task (e.g., machine translation, dialogue, summarization, …) à Self-supervised learning can be used in pre-training • Generated texts are not fluent (can be recognized as machine-generated texts easily) • Not easy to integrate multiple modalities (speech, vision, …) 23

Progress of neural machine translation after 2010s 1. Encoder-decoder architecture
2. Attention mechanism 3. Cross-lingual methods 4. Zero-shot translation 25 Combination of two neural network models (2013) Dynamically determines contextual information (2015) Self-attention network (2017) Google NMT (2016) OpenAI GPT-3 (2020) Language-independent subwords (2016) Joint learning of multilingual models (2017) Pre-training of encoder-decoders (2019)

Encoder-decoder models for language (understanding and) generation Combines two neural
networks • Sutskever et al. 2014. Sequence to Sequence Learning with Neural Networks. NeurIPS. 26 深層学習マジやばい </s> DL is DL is really cool really cool </s> encoder decoder Encode a sentence vector from source word vectors

Decoding target texts by attending the source texts (context vectors)
Weighted sum of hidden source vectors • Bahdanau et al. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR. 27 深層学習マジやばい </s> DL is is really cool really cool </s> Using combination of context (word) vectors instead of using a single sentence vector DL </s>

Transformer attends both encoder and decoder side sequences Fluent output
is obtained by generating the words one by one (Vaswani et al., 2017) 今⽇暑いですね encoder hidden <⽂頭> it is it is decoder

Encoder-decoder models with attention enable fluent language generation 29 https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html
Achieved human parity in Chinese- English translation (Microsoft 2018)

Multilingual encoders can learn language independent sentence representation Translation without
a parallel corpus of the target language • Johnson et al. 2017. Googleʼs Machine Translation System: Enabling Zero-Shot Translation. TACL. 30 Visualization of sentence vectors with the same meaning (Colors represent the meaning of the sentences)

Encoder-decoders enable sequence-to- sequence transformation in any forms • Zaremba
and Sutskever. 2015. Learning to Execute. à Learns an interpreter of a Python-like language • Vinyals et al. 2015. Show and Tell: A Neural Image Caption Generator. à Generate texts from an image 31

Pre-trained models are not good at making (inter-sentential) inference •
Parikh et al. 2020. ToTTo: A Controlled Table-to-Text Generation Dataset. EMNLP. à Dataset for generating texts from tables 32 https://ai.googleblog.com/2021/01/totto-controlled-table-to-text.html

GPT: Generative Pre-trained Transformer Performances of many tasks improve by
a large decoder • Brown et al. 2020. Language Models are Few-Shot Learners. NeurIPS. 33

GPT (w/ Transformer) learns from large- scale data with the
help of massive model Transformer is Turing complete • Yun et al. 2020. Are Transformers Universal Approximators of Sequence-to-sequence Functions? ICLR. 34 (Brown et al., 2020)

Prompt engineering is a new paradigm for AI in interaction
with language 35 (Kojima et al., 2022)

each target task (e.g., machine translation, dialogue, summarization, …) à Self-supervised learning can be used in pre-training • Generated texts are not fluent (can be recognized as machine-generated texts easily) à Encoder-decoder models boost fluency of text generation • Not easy to integrate multiple modalities (speech, vision, …) 36

Multimodal extension: Interaction between vision and language by deep learning
Language Transformer (SAN) Speech Sequence (RNN/CTC) Image Convolution (CNN) 38 CNN: Convolutional Neural Network RNN: Recurrent Neural Network CTC: Connectionist Temporal Classification SAN: Self-Attention Network

Deep learning advanced object detection and semantic segmentation Extract rectangle
regions and assign categories (semantic segmentation) • Girshick et al. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR. 39

Attention mechanism is also one of the key components in
object detection Object detection improved by attending regions from CNN • Ren et al. 2016. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. PAMI. 40

From semantic (rectangle region) segmentation to instance segmentation Instance segmentation
by masking output of Faster R-CNN • He et al. 2017. Mask R-CNN. ICCV. 41

Image captioning by combining vision (CNN) and language (RNN) networks
Separate models for vision and language • Xu et al. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. CVPR. 42 Visualization of the attention during generation of each word in image captioning (top: soft-attention, bottom: hard-attention)

Fusing vision and language by attention Integration of the output
of Faster R-CNN by attention • Anderson et al. 2018. Bottom-Up and Top-Down Attention for Visual Question Answering. CVPR. 43

Visual information helps neural machine translation Attending the object region
during translation • Zhao et al. 2021. Region-Attentive Multimodal Neural Machine Translation. Neurocomputing. 44

GPT can generate images from texts GPT learns to generate
images just like texts • Ramesh et al. 2021. Zero-Shot Text-to-Image Generation. 45 an illustration of a baby daikon radish in a tutu walking a dog (https://openai.com/blog/dall-e/)

each target task (e.g., machine translation, dialogue, summarization, …) à Self-supervised learning can be used in pre-training • Generated texts are not fluent (can be recognized as machine-generated texts easily) à Encoder-decoder models boost fluency of text generation • Not easy to integrate multiple modalities (speech, vision, …) à Transformers can unify input/output of any modalities 47

Problems in pre-trained models What does BERT understand? (BERTology) •
Not good at sequence modeling with multiple actions • Not good at dealing with knowledge based on real-world experience Bias in pre-trained models • Bias from data • Bias from models 49 Sentiment analysis via GPT-3 by generating texts from “The {race} man/woman was very …” (Brown et al., 2020)

Problems in language generation Evaluation of the generated texts •
Human evaluation and automatic evaluation (Humans prefer fluent texts over adequate ones) • Meta-evaluation or evaluation methodology (Data leakage may occur) Treatment of the generated texts • Copyright (data source and output) • Ethical consideration (socio-political issue) 50

Summary • Pre-trained models can solve the data acquisition bottleneck
• Encoder-decoder models enable fluent language generation • Deep learning opens the possibility of integrating multiple modalities 51

Recent advances in natural language understandi...

Recent advances in natural language understanding and natural language generation

More Decks by Mamoru Komachi

Other Decks in Research

Featured

Transcript