Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Enhancing Large Vision-Language Models with La...

Avatar for eida eida
July 14, 2025
13

Enhancing Large Vision-Language Models with Layout Modality for Table Question Answering on Japanese Annual Securities Reports

IIAI AAI / CDEF
2025/07/15
paper :
https://arxiv.org/abs/2505.17625

Avatar for eida

eida

July 14, 2025
Tweet

Transcript

  1. Enhancing Large Vision-Language Models with Layout Modality for Table Question

    Answering on Japanese Annual Securities Reports Hayato Aida, Kosuke Takahashi, Takahiro Omi Stockmark, Japan IIAI AAI / CDEF 2025
  2. • Financial tables appear in diverse formats (PDFs, images, etc.)

    • This poses a major challenge for practical RAG systems • LVLMs offer a format-agnostic approach by treating any table as an image • However, their accuracy remains a critical hurdle for real-world use • Goal: Enhance LVLM table understanding by adding modalities to complement the image Motivation 2 Clean HTML structure (Given in NTCIR18-U4, Table QA dataset) Real-world input format (Our challenge) <table><tr><td colspan="2">回次</td><td>第65 期</td><td>第66期</td><td>第67期</td><td>第 68期</td><td>第69期</td></tr><tr><td colspan="2">決算年月</td><td>2016年1月 </td><td>2017年1月</td><td>2018年1月 </td><td>2019年1月</td>……..</table>
  3. We focused on PDF tables and used images, text, and

    layout to tackle the Table QA. Representation of Table Structures 3 Highly structured formats Low structure / unstructured formats <table><tr><td colspan="2"> 回次</td><td>第65期 </td><td>第66期</td><td>第 67期……..</table> HTML, Json, Markdown,.. Image PDF or Image + OCR Most business documents for RAG
  4. 1. Data: Proposed TableCellQA, a new benchmark for evaluating pure

    table structure understanding, derived from the NTCIR-18 U4 task 2. Model: Enhanced LVLMs by integrating Text (T) and Layout (L) modalities with the image. Our Key Contributions 4
  5. Original Task (NTCIR-18 U4): Included complex reasoning like unit conversion.

    Our Goal: To create a task that purely evaluates a model's ability to understand table structure, separate from its reasoning skills Our Solution (TableCellQA): The answer is always the raw value of a single cell, with no calculations needed. Task Definition: TableCellQA 5
  6. We use two additional modalities: Text and Layout. Layout: The

    bounding box coordinates for each text block. Benefit: Avoids complex table structure reconstruction. Layout and Text Modality 6
  7. • Layout coordinates are encoded into features via an MLP.

    • Each text token is fused with its corresponding layout feature. • These text-layout pairs and image features form the final input for the LLM. Layout-Aware LVLM Architecture 7 Answer
  8. 1. General Setup • Base model: LLaVA-OneVision-7B • Fine-tuning: All

    models were fine-tuned on the TableCellQA dataset for 2 epochs with a learning rate of 1e-5. Full-parameter fine-tuning was applied. 2. Ablation Study Conditions • To analyze the impact of each modality, we compared the following 5 settings. Experiments 8 L+T+I Training with Image, Text and Layout L+T Training with Text and Layout T+I Training with Image and Text L+I Training with Layout and Image I Training with Image
  9. Top performance was achieved by the L+T+I and L+T settings.

    The hierarchy of importance for accuracy is: Text > Layout > Image. Task-specific fine-tuning is crucial, as all our models outperform zero- shot SOTA baselines. Results 9 0.575 0.663 0.876 0.873 0.939 0.951 0.948 0.671 0.761 0.914 0.913 0.957 0.965 0.967 -0.1 0.1 0.3 0.5 0.7 0.9 1.1 GPT-4o (I) Qwen2.5-VL-72B (I) I L+I T+I L+T L+T+I ANLS Acc.
  10. Problem: Aligning row and column context in 2D space. Without

    Layout: Lacks spatial context → selects an incorrect neighboring cell. With Layout: Provides crucial spatial context → correctly identifies the target cell. Case Study 1: Why Layout Modality is Crucial 10
  11. Problem: Potential for OCR or recognition errors on small and

    dense text. Without Text: Image-only model (I): Susceptible to minor recognition errors With Text + Layout: Providing clean text data: Bypasses visual errors, leading to correct value extraction. Case Study 2: Why Text Modality is Crucial 11
  12. • Structured text (especially HTML) sets the performance upper bound.

    • However, this clean data is unavailable in most real-world documents like PDFs. • Our L+T+I model's performance is remarkably close to this upper bound, making it a practical substitute. Comparison with structured Table 12 0.876 0.948 0.954 0.956 0.966 0.914 0.967 0.966 0.969 0.976 0 0.2 0.4 0.6 0.8 1 I L+T+I Markdown JSON HTML ANLS Acc.
  13. Limitations • While our method is a practical substitute, perfectly

    structured text (like HTML) still achieves the highest absolute performance when available. • While our LVLM architecture has potential for visual elements like figures, the scope of the TableCellQA dataset currently limits our evaluation to text- based tables. Future Work • Robustness for Noisy Documents: By exploring end-to-end models or enhanced OCR. Limitations / Future Work 13
  14. • We demonstrated that enhancing LVLMs with Text (T) and

    Layout (L) modalities significantly improves table question answering performance. • Our analysis revealed a clear hierarchy of importance for accuracy: Text > Layout > Image. • Our multimodal approach serves as a practical intermediate solution, bridging the gap between image-only models and methods that require perfectly structured data. Conclusion 14