3⽉ 名古屋⼯業⼤学 修⼠ (情報⼯学) l 李研究室で「Knowledgeに基づく対話⽣成」の研究に従事 l 対話システムの性能を測る国際コンペ (DSTC7) で世界⼆位 n 2020年 4⽉ NTT⼊社 l Vision & Language 機械読解の研究開発をスタート l AAAI21採択,NLP21最優秀賞,InfographicVQAコンペ 世界⼆位 1
質問⽂,⽂書画像 Output: 回答 ⽂書レイアウト解析 OCR 2007 Ig Nobel Prize winners announced The winners of the 2007 Ig Nobel have been announced. The awards, given out every early October since 1991 by the Annals of Improbable Research, are a parody of the Nobel Prize, which are awards given out in several fields. The awards are given to achievements that, "first make people laugh, and then make them think." They were presented at Harvard University’s Sanders Theater. Ten awards have been presented, each given to a different field. The winners are: Medicine: Brian Witcombe, of Gloucestershire Royal NHS Foundation Trust, UK, and Dan Meyer, who studied the health consequences of sword swallowing. etc.
4. 視覚的読解: ⽂書表現の獲得および質問応答 11 1. 2007 Ig Nobel Prize winners announced The winners of the 2007 Ig Nobel have been announced. The awards, given out every early October since 1991 by the Annals of Improbable Research, are a parody of the Nobel Prize, which are awards given out in several fields. The awards are given to achievements that, "first make people laugh, and then make them think." 2. 2007 Ig Nobel have been announced. The awards, given out every early October since 1991 by the Annals of Improbable Research, are a parody of the Nobel Prize, which are awards given out in several fields. 2007 Ig Nobel Prize winners announced The winners of the The awards are given to achievements that, "first make people laugh, and then make them think." 3. ⽂書レイアウト解析 OCR 読み順検出 並び替え 質問 回答 4. 視覚的読解 ⾏わない/⼀部のみ⾏う ケースがある etc. 前処理 画像,OCRテキスト,レイアウト etc.
4. 視覚的読解: ⽂書表現の獲得および質問応答 17 1. 2007 Ig Nobel Prize winners announced The winners of the 2007 Ig Nobel have been announced. The awards, given out every early October since 1991 by the Annals of Improbable Research, are a parody of the Nobel Prize, which are awards given out in several fields. The awards are given to achievements that, "first make people laugh, and then make them think." 2. 2007 Ig Nobel have been announced. The awards, given out every early October since 1991 by the Annals of Improbable Research, are a parody of the Nobel Prize, which are awards given out in several fields. 2007 Ig Nobel Prize winners announced The winners of the The awards are given to achievements that, "first make people laugh, and then make them think." 3. ⽂書レイアウト解析 OCR 読み順予測 並び替え 質問 回答 4. 視覚的読解 ⾏わない/⼀部のみ⾏う ケースがある etc. 前処理 画像,OCRテキスト,レイアウト etc.
ベルの⾔語理解で⼗分 43 What does the white sign say? Answer: Tokyo Station What is the top oz? Answer: 16 What edition is this? Answer: embossed https://arxiv.org/abs/1904.08920
対象⽂書の多くは,1960年代くらいの古い⽂書 44 Mention the ZIP code written? Answer: 80202 What is the date given at the top left? Answer: 03/17/98 What is the Extension Number as per the voucher? Answer: (910) 741-0673 https://arxiv.org/abs/2007.00398
l 演算を含む様々なスタイルでの回答が必要 49 How many females are affected by diabetes? single span Which all are the benefits of inve sting in real estate? multi-span What percentage of recruiters do "not" react negatively to poor spellings and punctuation errors? number (non-span) Answer: 3.6% Answer: 35% *(100 – 65) Answer: tax, tangibility, cash returns
演算が必要 Q: How many females are affected by diabetes? A: 3.6% Q: What percentage of cases can not be prevented? A: 40% (⼥性の糖尿病患者の割合は︖) (糖尿病を予防できないケースの割合は︖)
not use social media to seek out health information ? (3⼈の患者の内,健康情報を調べるためにsocial mediaを使⽤しないのは何⼈︖) BERT: 1 LayoutLM: 3 提案法: 2 (3-1) 正解,提案法: 2 LayoutLM: 3 モデルが予測した演算過程 BERT: 1
視覚的読解モデルとデータセットの動向 l モデル: レイアウト特徴の⼊⼒⽅法,画像特徴の系列化,事前学習の実施 l データセット: 画像内のテキストの⽂脈理解,視覚物体との融合理解,回 答スタイルの多様化,複数⽂書化 n 今後の展望 l OCRなどの前処理が不要な視覚的読解モデル l 多⾔語に対応可能な視覚的読解モデル 68
Huang, Furu Wei, Ming Zhou, “LayoutLM: Pre-training of Text and Layout for Document Image Understanding”, in KDD20 n Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, “LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding”, in ACL21 n Ryota Tanaka, Kyosuke Nsihida, Shuichi Nishioka, “VisualMRC: Machine Reading Comprehension on Document Images”, in AAAI21 n Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park, “BROS: A Pre- trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents”, in AAAI22 n Chenliang Li, Bin Bi, Ming Yan, Wei Wang, Songfang Huang, Fei Huang, Luo Si, “StructuralLM: Structural Pre-training for Form Understanding”, in ACL21 n Geewook Kim, Teakgyu Hong, Moonbin Yim, Jinyoung Park†, Jinyeong Yim, Wonseok Hwang†, Sangdoo Yun, Dongyoon Han, Seunghyun Park, “Donut : Document Understanding Transformer without OCR”, in arXiv21111.15664 n Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, R. Manmatha, “DocFormer: End-to- End Transformer for Document Understanding”, in CVPR21 n Rafał Powalski, Łukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Michał Pietruszka, Gabriela Pałka, ” Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer”, in ICDAR21 n Xiang Deng, Prashant Shiralkar, Colin Lockard, Binxuan Huang, Huan Sun, “DOM-LM: Learning Generalizable Representations for HTML Documents”, in arXiv:2201.10608 69
Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, “Visual Question Answering”, in ICCV15 n Singh, Amanpreet and Natarjan, Vivek and Shah, Meet and Jiang, Yu and Chen, Xinlei and Parikh, Devi and Rohrbach, Marcus, “Towards VQA Models That Can Read”, in CVPR19 n Mathew, Minesh and Karatzas, Dimosthenis and Jawahar, C.V., “DocVQA: A Dataset for VQA on Document Images”, in WACV21 n Ryota Tanaka, Kyosuke Nsihida, Shuichi Nishioka, “VisualMRC: Machine Reading Comprehension on Document Images”, in AAAI21 n Minesh Mathew and Viraj Bagal and Rubèn Pérez Tito and Dimosthenis Karatzas and Ernest Valveny and C. V Jawahar, “InfographicVQA”, in WACV22 n Ruben Tito, Dimonsthenis Karazas, Ernest Valveny, “Document Collection Visual Question Answering”, in arXiv: 2104.14336 n Xingyu Chen, Zihan Zhao, Lu Chen∗ , Jiabao JI, Danyang Zhang, Ao Luo, Yuxuan Xiong and Kai Yu, “WebSRC: A Dataset for Web-Based Structural Reading Comprehension”, in EMNLP21 n Zilong Wang, Yiheng Xu, Lei Cui, Jingbo Shang, Furu Wei, “LayoutReader: Pre-training of Text and Layout for Reading Order Detection”, in EMNLP21 n Łukasz Borchmann, Michał Pietruszka, Tomasz Stanislawek, Dawid Jurkiewicz, Michał Turski, Karolina Szyndler, Filip Graliński, “DUE: End-to-End Document Understanding Benchmark”, in NeurIPS21 dataset track 70
Singh, “TextCaps: a Dataset for Image Captioning with Reading Comprehension”, in ECCV20 n Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, in arXiv:1506.01497 n Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo, “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, in ICCV21 n Alexey Dosovitskiy · Lucas Beyer · Alexander Kolesnikov · Dirk Weissenborn · Xiaohua Zhai · Thomas Unterthiner · Mostafa Dehghani · Matthias Minderer · Georg Heigold · Sylvain Gelly · Jakob Uszkoreit · Neil Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, in ICLR21 n Carlos Soto, Shinjae Yoo, “Visual Detection with Context for Document Layout Analysis”, in EMNLP19 n Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, in NAACL19 n Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, Omer Levy, “SpanBERT: Improving Pre-training by Representing and Predicting Spans”, in TACL21 n Xu Zhong, Jianbin Tang, Antonio Jimeno Yepes, “PubLayNet: largest dataset ever for document layout analysis”, in ICDAR19 71