Enhancing Large Vision-Language Models with Layout Modality for Table Question Answering on Japanese Annual Securities Reports

Enhancing Large Vision-Language Models with Layout Modality for Table Question
Answering on Japanese Annual Securities Reports Hayato Aida, Kosuke Takahashi, Takahiro Omi Stockmark, Japan IIAI AAI / CDEF 2025

• Financial tables appear in diverse formats (PDFs, images, etc.)
• This poses a major challenge for practical RAG systems • LVLMs offer a format-agnostic approach by treating any table as an image • However, their accuracy remains a critical hurdle for real-world use • Goal: Enhance LVLM table understanding by adding modalities to complement the image Motivation 2 Clean HTML structure (Given in NTCIR18-U4, Table QA dataset) Real-world input format (Our challenge) <table><tr><td colspan="2">回次</td><td>第65 期</td><td>第66期</td><td>第67期</td><td>第 68期</td><td>第69期</td></tr><tr><td colspan="2">決算年月</td><td>2016年１月 </td><td>2017年１月</td><td>2018年１月 </td><td>2019年１月</td>……..</table>

We focused on PDF tables and used images, text, and
layout to tackle the Table QA. Representation of Table Structures 3 Highly structured formats Low structure / unstructured formats <table><tr><td colspan="2"> 回次</td><td>第65期 </td><td>第66期</td><td>第 67期……..</table> HTML, Json, Markdown,.. Image PDF or Image + OCR Most business documents for RAG

1. Data: Proposed TableCellQA, a new benchmark for evaluating pure
table structure understanding, derived from the NTCIR-18 U4 task 2. Model: Enhanced LVLMs by integrating Text (T) and Layout (L) modalities with the image. Our Key Contributions 4

Original Task (NTCIR-18 U4): Included complex reasoning like unit conversion.
Our Goal: To create a task that purely evaluates a model's ability to understand table structure, separate from its reasoning skills Our Solution (TableCellQA): The answer is always the raw value of a single cell, with no calculations needed. Task Definition: TableCellQA 5

We use two additional modalities: Text and Layout. Layout: The
bounding box coordinates for each text block. Benefit: Avoids complex table structure reconstruction. Layout and Text Modality 6

• Layout coordinates are encoded into features via an MLP.
• Each text token is fused with its corresponding layout feature. • These text-layout pairs and image features form the final input for the LLM. Layout-Aware LVLM Architecture 7 Answer

1. General Setup • Base model: LLaVA-OneVision-7B • Fine-tuning: All
models were fine-tuned on the TableCellQA dataset for 2 epochs with a learning rate of 1e-5. Full-parameter fine-tuning was applied. 2. Ablation Study Conditions • To analyze the impact of each modality, we compared the following 5 settings. Experiments 8 L+T+I Training with Image, Text and Layout L+T Training with Text and Layout T+I Training with Image and Text L+I Training with Layout and Image I Training with Image

Top performance was achieved by the L+T+I and L+T settings.
The hierarchy of importance for accuracy is: Text > Layout > Image. Task-specific fine-tuning is crucial, as all our models outperform zero- shot SOTA baselines. Results 9 0.575 0.663 0.876 0.873 0.939 0.951 0.948 0.671 0.761 0.914 0.913 0.957 0.965 0.967 -0.1 0.1 0.3 0.5 0.7 0.9 1.1 GPT-4o (I) Qwen2.5-VL-72B (I) I L+I T+I L+T L+T+I ANLS Acc.

Problem: Aligning row and column context in 2D space. Without
Layout: Lacks spatial context → selects an incorrect neighboring cell. With Layout: Provides crucial spatial context → correctly identifies the target cell. Case Study 1: Why Layout Modality is Crucial 10

Problem: Potential for OCR or recognition errors on small and
dense text. Without Text: Image-only model (I): Susceptible to minor recognition errors With Text + Layout: Providing clean text data: Bypasses visual errors, leading to correct value extraction. Case Study 2: Why Text Modality is Crucial 11

• Structured text (especially HTML) sets the performance upper bound.
• However, this clean data is unavailable in most real-world documents like PDFs. • Our L+T+I model's performance is remarkably close to this upper bound, making it a practical substitute. Comparison with structured Table 12 0.876 0.948 0.954 0.956 0.966 0.914 0.967 0.966 0.969 0.976 0 0.2 0.4 0.6 0.8 1 I L+T+I Markdown JSON HTML ANLS Acc.

Limitations • While our method is a practical substitute, perfectly
structured text (like HTML) still achieves the highest absolute performance when available. • While our LVLM architecture has potential for visual elements like figures, the scope of the TableCellQA dataset currently limits our evaluation to text- based tables. Future Work • Robustness for Noisy Documents: By exploring end-to-end models or enhanced OCR. Limitations / Future Work 13

• We demonstrated that enhancing LVLMs with Text (T) and
Layout (L) modalities significantly improves table question answering performance. • Our analysis revealed a clear hierarchy of importance for accuracy: Text > Layout > Image. • Our multimodal approach serves as a practical intermediate solution, bridging the gap between image-only models and methods that require perfectly structured data. Conclusion 14

Enhancing Large Vision-Language Models with La...

Enhancing Large Vision-Language Models with Layout Modality for Table Question Answering on Japanese Annual Securities Reports

eida

More Decks by eida

Featured

Transcript

Enhancing Large Vision-Language Models with Layout Modality for Table Question

• Financial tables appear in diverse formats (PDFs, images, etc.)

We focused on PDF tables and used images, text, and

1. Data: Proposed TableCellQA, a new benchmark for evaluating pure

Original Task (NTCIR-18 U4): Included complex reasoning like unit conversion.

We use two additional modalities: Text and Layout. Layout: The

• Layout coordinates are encoded into features via an MLP.

1. General Setup • Base model: LLaVA-OneVision-7B • Fine-tuning: All

Top performance was achieved by the L+T+I and L+T settings.

Problem: Aligning row and column context in 2D space. Without

Problem: Potential for OCR or recognition errors on small and

• Structured text (especially HTML) sets the performance upper bound.

Limitations • While our method is a practical substitute, perfectly

• We demonstrated that enhancing LVLMs with Text (T) and