Language Policy in Colonial Taiwan” • 2005.04-2010.03 Nara Institute of Science and Technology (D.Eng.) “Graph-Theoretic Approaches to Minimally-Supervised Natural Language Learning” • 2010.04-2013.03 Nara Institute of Science and Technology (Assistant Prof.) • 2013.04-present Tokyo Metropolitan University (Associate Prof. → Prof.) 2
text (input) and generate the target text (output) • Mainstream of (commercial) translation models by 1990s 5 CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=683855 Bernard Vauquoisʼ Pyramid
corpus Noisy channel model 7 ˆ e = argmax e P(e | f ) = argmax e P( f | e)P(e) Parallel corpus Source Target ① Translation model ② Language model P(f |e) TM Target Target P(e) LM ③ Decoding Beam search ④ Optimization BLEU Source Target Raw corpus Evaluation corpus (reference)
the statistical model 8 Low Performance High Small Data Large Brants et al. Large Language Models in Machine Translation. EMNLP 2007. Performance linearly increases in a log scale
each target task (e.g., machine translation, dialogue, summarization, …) • Generated texts are not fluent (can be recognized as machine-generated texts easily) • Not easy to integrate multiple modalities (speech, vision, …) 9
king ? • King Arthur is a legendary British leader of the late 5th and early 6th centuries • Edward VIII was King of the United Kingdom and the Dominions of the British Empire, … A word can be characterized by its surrounding words à (distributional hypothesis) “You shall know a word by the company it keeps.” (Firth, 1957) 11
it co-occurs with “empire” How often it co-occurs with “man” …… Similarity of words (vectors) Similarity = cos θ = king!queen king |queen| Representing the meaning of a word by using a vector (vector space model) 12 empire king man queen rule woman empire 545 512 195 276 374 king 545 2330 689 799 1100 man 512 2330 915 593 2620 queen 195 689 915 448 708 rule 276 799 593 448 1170 woman 374 1100 2620 708 1170
(embeddings) by training a neural network for predicting their context à Predict surrounding words by the dot product of vectors à Negative examples are obtained by randomly replacing the context words (self-supervised learning) • She was the mother of Queen Elizabeth II . 13 (Mikolov et al., NeurIPS 2013)
of word vectors considering the usage of English learners (t-SNE) • Red: word vectors trained on a native speaker corpus • Blue: word vectors trained on a English learner corpus 14 (Kaneko et al., IJCNLP 2017)
relationship between countries and they capital cities is preserved in the vector space learned by word2vec (Mikolov et al., 2013) à Algebraic operation such as “king ‒ man + woman = queen” can be performed 15
• Baroni and Zamperelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. EMNLP. Transitive verb à Tensor? • Socher et al. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. EMNLP. 16 Baroni and Zamperelli (2010)
sentence considering its context? BERT: Bidirectional Transformers for Language Understanding (2019) à Self-supervised learning of contextualized word vectors à Modeling contexts by Transformer cells 17
and invoices Learn a vector considering the relative position in a picture • Majumder et al. 2020. Representation Learning for Information Extraction from Form-like Documents. ACL. 18 https://ai.googleblog.com/2020/06/extracting-structured-data-from.html
table format Pre-training can be applied to table-like datasets • Herzig et al. 2020. TaPas: Weakly Supervised Table Parsing via Pre-training. ACL. 20 https://ai.googleblog.com/2020/04/using-neural-networks-to-find-answers.html
Liu et al. 2020. FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining. IJCAI. • Chalkiodis et al. 2020. LEGAL-BERT: The Muppets straight out of Law School. EMNLP. 21 ↑ LEGAL-BERT ← FinBERT
et al. 2019. Visualizing and Measuring the Geometry of BERT. NeurIPS. Cross-lingual grammar • Chi et al. 2020. Finding Grammatical Relations in Multilingual BERT. ACL. 22
each target task (e.g., machine translation, dialogue, summarization, …) à Self-supervised learning can be used in pre-training • Generated texts are not fluent (can be recognized as machine-generated texts easily) • Not easy to integrate multiple modalities (speech, vision, …) 23
networks • Sutskever et al. 2014. Sequence to Sequence Learning with Neural Networks. NeurIPS. 26 深層学習 マジ やばい </s> DL is DL is really cool really cool </s> encoder decoder Encode a sentence vector from source word vectors
Weighted sum of hidden source vectors • Bahdanau et al. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR. 27 深層学習 マジ やばい </s> DL is is really cool really cool </s> Using combination of context (word) vectors instead of using a single sentence vector DL </s>
a parallel corpus of the target language • Johnson et al. 2017. Googleʼs Machine Translation System: Enabling Zero-Shot Translation. TACL. 30 Visualization of sentence vectors with the same meaning (Colors represent the meaning of the sentences)
and Sutskever. 2015. Learning to Execute. à Learns an interpreter of a Python-like language • Vinyals et al. 2015. Show and Tell: A Neural Image Caption Generator. à Generate texts from an image 31
Parikh et al. 2020. ToTTo: A Controlled Table-to-Text Generation Dataset. EMNLP. à Dataset for generating texts from tables 32 https://ai.googleblog.com/2021/01/totto-controlled-table-to-text.html
help of massive model Transformer is Turing complete • Yun et al. 2020. Are Transformers Universal Approximators of Sequence-to-sequence Functions? ICLR. 34 (Brown et al., 2020)
each target task (e.g., machine translation, dialogue, summarization, …) à Self-supervised learning can be used in pre-training • Generated texts are not fluent (can be recognized as machine-generated texts easily) à Encoder-decoder models boost fluency of text generation • Not easy to integrate multiple modalities (speech, vision, …) 36
regions and assign categories (semantic segmentation) • Girshick et al. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR. 39
object detection Object detection improved by attending regions from CNN • Ren et al. 2016. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. PAMI. 40
Separate models for vision and language • Xu et al. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. CVPR. 42 Visualization of the attention during generation of each word in image captioning (top: soft-attention, bottom: hard-attention)
images just like texts • Ramesh et al. 2021. Zero-Shot Text-to-Image Generation. 45 an illustration of a baby daikon radish in a tutu walking a dog (https://openai.com/blog/dall-e/)
each target task (e.g., machine translation, dialogue, summarization, …) à Self-supervised learning can be used in pre-training • Generated texts are not fluent (can be recognized as machine-generated texts easily) à Encoder-decoder models boost fluency of text generation • Not easy to integrate multiple modalities (speech, vision, …) à Transformers can unify input/output of any modalities 47
Not good at sequence modeling with multiple actions • Not good at dealing with knowledge based on real-world experience Bias in pre-trained models • Bias from data • Bias from models 49 Sentiment analysis via GPT-3 by generating texts from “The {race} man/woman was very …” (Brown et al., 2020)
Human evaluation and automatic evaluation (Humans prefer fluent texts over adequate ones) • Meta-evaluation or evaluation methodology (Data leakage may occur) Treatment of the generated texts • Copyright (data source and output) • Ethical consideration (socio-political issue) 50