Donut + SegFormer ~ Enhancing Donut for position prediction

by marisakamozz

Slide 1

Slide 1 text

Donut + SegFormer Donut で位置を推定する Enhancing Donut for position prediction

Slide 2

Slide 2 text

● Joined as an ML Engineer in 2022-03  ● Kaggle Competitions Master  ● https://kaggle.com/marisakamozz  MORI Masakazu

Slide 3

Slide 3 text

● Introduction  ● Proposed Method  ● Experiment Settings  ● Experiment Results  ● Conclusion  ● Appendix    ● はじめに  ● 提案手法  ● 実験の設定  ● 実験の結果  ● 結論  ● 補足  アジェンダ AGENDA

Slide 4

Slide 4 text

はじめに Introduction

Slide 5

Slide 5 text

請求書などの文書から請求金額などの情報を読み取るタスク。様々な様式が存在するため、特定のルールで読み取ることはできない。  A task to extract information such as billing amounts from documents like invoices. Since there are various formats, it is not possible to read the information using speciﬁc set of rules.  AI-OCRとは？ What is AI-OCR? The above image is cited from “DocILE Benchmark for Document Information Localization and Extraction” authored by Štěpán Šimsa et al on 3 May 2023. https://arxiv.org/abs/2302.05658 (accessed on 23 Aug 2024)

Slide 6

Slide 6 text

● OCR → Named Entity Recognition (NER)  ○ Convert the document to text using OCR  ○ Identify the target phrases within the text (NER)  ● Object Detection → OCR  ○ Identify the position in the document where the target phrases are written (Object Detection)  ○ Convert the text at that position using OCR  ● End-to-End  ○ Achieve this with a single model  ● OCR → 固有表現抽出  ○ 文書をOCRでテキストに変換  ○ テキストの中から対象の文言を特定（固有表現抽出）  ● 物体検出 → OCR  ○ 文書から対象の文言が記載された場所を特定（物体検出）  ○ その場所をOCRでテキストに変換  ● End-to-End  ○ 一つのモデルで実現   代表的な既存手法 Representative Existing Methods

Slide 7

Slide 7 text

● An end-to-End model proposed by NAVER Corporation in 2021   ● Performs image processing and OCR seamlessly  ● 2021年にNAVER社から提案された End-to-Endモデル  ● 画像処理とOCRを一気通貫で実行  Donut: Document Understanding Transformer The above image is cited from “OCR-free Document Understanding Transformer” authored by Geewook Kim et al on 6 Oct 2022. https://arxiv.org/abs/2111.15664 (accessed on 23 Aug 2024)

Slide 8

Slide 8 text

Ground Truth: “1,000.00”          どちらの方が正しい「1,000.00」？  Which one is the correct “1,000.00” ?  Donutの問題点 Issues with Donut Donutは学習時に読み取った文言だけを与え、場所は教えない。  そのため、文書内に同じ文言が複数あった場合、どの場所から読み取るべきかを学習できない。  Donut is only provided with the text it reads during training, without being informed of the position.  Therefore, if the same text appears multiple times within the document, it cannot learn from which position it should read. 

Slide 9

Slide 9 text

場所の情報を使えばDonutを改善できるのではないか？  Is it possible to improve Donut by utilizing position information?  仮説 Hypothesis

Slide 10

Slide 10 text

提案手法 Proposed Method

Slide 11

Slide 11 text

Donut + SegFormer 🍩 Donut Encoder 🍩 Donut Docoder SegFormer Decoder 2024-09-20 Prompt Image Extracted Text Segmentation Map

Slide 12

Slide 12 text

“SegFormer is a model for semantic segmentation introduced by Xie et al. in 2021. It has a hierarchical Transformer encoder that doesn't use positional encodings (in contrast to ViT) and a simple multi-layer perceptron decoder. SegFormer achieves state-of-the-art performance on multiple common datasets.”  The above text is cited from “Fine-Tune a Semantic Segmentation Model with a Custom Dataset” authored by Tobias Cornille, Niels Rogge  SegFormerとは？ What is SegFormer?

Slide 13

Slide 13 text

SegFormer Decoder The above image is cited from “SegFormer: Simple and Efﬁcient Design for Semantic Segmentation with Transformers” authored by Enze Xie et al on 28 Oct 2021. https://arxiv.org/abs/2105.15203 (accessed on 23 Aug 2024) ● Decoderのみ利用  ● 2層のMLP    ● Use only Decoder  ● 2 MLP Layers 

Slide 14

Slide 14 text

実験の設定 Experiment Settings

Slide 15

Slide 15 text

● DocILE: Document Information Localization and Extraction  ● https://docile.rossum.ai/  ● ICDAR 2023という学会で開催されたコンペで使用されたデータセット  ● 画像とその画像に記載されているテキスト及びその場所を含む  ● The dataset used in a competition held at the ICDAR 2023 conference.  ● Including images, the text contained in those images, and their positions.  実験に使用したデータセット Dataset Used in This Experiment The above image is cited from “DocILE Benchmark for Document Information Localization and Extraction” authored by Štěpán Šimsa et al on 3 May 2023. https://arxiv.org/abs/2302.05658 (accessed on 23 Aug 2024)

Slide 16

Slide 16 text

前処理 Preprocessing 最も多く存在した8種類の項目のみを利用  ● 学習データ: 5181件  ● 評価データ: 501件  Only the 8 most frequently appearing field types were used.  ● Training Dataset: 5181 images  ● Evaluation Dataset: 501 images    Top 8 Field Types There may be multiple field types in a single document, so the number of field types exceeds the number of documents, which is 5,181.

Slide 17

Slide 17 text

モデル訓練時の設定 Training Conﬁgurations ● 画像拡張なし  ● オプティマイザ  ○ AdamW(lr=1e-5)  ● スケジューラ  ○ ReduceLROnPlataue(patience=5, factor=0.2)  ● バッチサイズ: 1  ● Early Stopping:  ○ 学習データの内5% (259件) をvalidation に使用  ○ 1000 stepごとに評価  ○ patience=10  ● No Image Augmentation  ● Optimizer:  ○ AdamW(lr=1e-5)  ● Scheduler:  ○ ReduceLROnPlataue(patience=5, factor=0.2)  ● Batch_Size: 1  ● Early Stopping:  ○ 5% (=259 images) of training datasets were used for validation  ○ Validate every 1000 steps  ○ patience=10 

Slide 18

Slide 18 text

比較対象 Comparison Target ● 既存手法  ○ Donut  ○ Hugging Face Hubの事前学習モデル (naver-clova-ix/donut-base)を使用  ● 提案手法  ○ Donut + SegFormer  ○ Donut EncoderとDonut Decoderは Hugging Face Hubの事前学習モデル (naver-clova-ix/donut-base)を使用  ○ Segformer Decoderはスクラッチで学習  ● Existing Method  ○ Donut  ○ Use the pre-trained model available on the Hugging Face Hub. (naver-clova-ix/donut-base)  ● Proposed Method  ○ Donut + SegFormer  ○ Use the pre-trained model (naver-clova-ix/donut-base) from the Hugging Face Hub for the Donut Encoder and Donut Decoder  ○ The Segformer Decoder is trained from scratch. 

Slide 19

Slide 19 text

● AWS EC2 instance  ○ g4dn.xlarge  ○ 4 vCPU, 16GB memory  ○ NVIDIA T4  ● Time to train models  ○ Existing method (Donut)  ■ Approx. 7.5 hours  ○ Proposed method (Donut+SegFormer)  ■ Approx. 8.5 hours  使用したマシン Machine Used ● AWS EC2インスタンス  ○ g4dn.xlarge  ○ 4 vCPU, 16GB memory  ○ NVIDIA T4  ● モデル訓練時間  ○ 既存手法 (Donut)  ■ 約7時間半  ○ 提案手法 (Donut + SegFormer)  ■ 約8時間半 

Slide 20

Slide 20 text

実験の結果 Experiment Results

Slide 21

Slide 21 text

実験結果サマリー（全項目） Experiment Result Summary (All Fields) 提案手法によってF1スコアが上昇することが確認できた。  特に、文字列の類似度で評価した場合は大きくスコアが上昇している。  It was conﬁrmed that the F1 score increased with the proposed method.   The score particularly improved when evaluated based on string similarity.  similarity = 1 - normalized Levenshtein distance Higher is better.

Slide 22

Slide 22 text

各項目の実験結果（F1スコア） Experiment Result For Each Field (F1 score) ベンダー名のような多く存在する項目や住所のような長い文字列では、Donutの方が F1スコアは高い。  Donut achieves higher F1 scores for frequently appearing items like vendor names and long strings such as addresses. 

Slide 23

Slide 23 text

各項目の実験結果（類似度ベースF1スコア） Experiment Result For Each Field (Similarity Based F1 score) 文字列の類似度で評価した場合は、ほぼ全ての項目で提案手法が同等もしくは大きく上回るスコアを達成。  When evaluated based on string similarity, the proposed method achieved scores that were either comparable to or much higher than the existing method. 

Slide 24

Slide 24 text

Donut + SegFormer Position Predictions vendor_name vendor_address

Slide 25

Slide 25 text

Donut + SegFormer Position Predictions date_issue customer_billing_name

Slide 26

Slide 26 text

Donut + SegFormer Position Predictions document_id amount_due

Slide 27

Slide 27 text

Donut + SegFormer Position Predictions amount_total_gross customer_billing_address

Slide 28

Slide 28 text

結論 Conclusion

Slide 29

Slide 29 text

DonutにSegFormer Decoderを追加する事によって、場所の情報をうまく活用することができることがわかった。  提案手法では文字列が記載されている場所を提示することができるようになるだけではなく、精度も向上させられることがわかった。  一方、場所の特定が簡単な場合や長い文章は不得意であることがわかった。  提案手法は特許出願済みです。  結論 Conclusion By adding the SegFormer Decoder to Donut, it was confirmed that it effectively utilizes position information.  The proposed method not only allows us to indicate the location of the text but also improves accuracy.  On the other hand, it was found to be less effective in cases where the location is easy to identify or when dealing with long texts.  The proposed method is patent-pending. 

Slide 30

Slide 30 text

補足 Appendix

Slide 31

Slide 31 text

Donut  Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park (2021). OCR-free Document Understanding Transformer. https://arxiv.org/abs/2111.15664  SegFormer  Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo (2021). SegFormer: Simple and Eﬃcient Design for Semantic Segmentation with Transformers. https://arxiv.org/abs/2105.15203  DocILE  Štěpán Šimsa, Milan Šulc, Michal Uřičář, Yash Patel, Ahmed Hamdi, Matěj Kocián, Matyáš Skalický, Jiří Matas, Antoine Doucet, Mickaël Coustaty, Dimosthenis Karatzas (2023). DocILE Benchmark for Document Information Localization and Extraction. https://arxiv.org/abs/2302.05658  Appendix: Citation

Slide 32

Slide 32 text

No content