Upgrade to Pro — share decks privately, control downloads, hide ads and more …

STMK24 NTCIR18 U4 Table QA Submission

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for eida eida
June 11, 2025
28

STMK24 NTCIR18 U4 Table QA Submission

Slides used for the oral presentation at NTICIR18
https://research.nii.ac.jp/ntcir/ntcir-18/program.html

Avatar for eida

eida

June 11, 2025
Tweet

Transcript

  1. U4 Task: QA over clean HTML tables Real-world tables: often

    in image or PDF format Challenge: RAG needs robust table handling across formats Our approach: render HTML → solve as multimodal QA Goal: develop a practical method usable in business RAG Motivation 2 Clean HTML structure (Given in U4 task) Real-world input format (Our challenge) <table><tr><td colspan="2">回次</td><td>第65 期</td><td>第66期</td><td>第67期</td><td>第 68期</td><td>第69期</td></tr><tr><td colspan="2">決算年月</td><td>2016年1月 </td><td>2017年1月</td><td>2018年1月 </td><td>2019年1月</td>……..</table>
  2. We focused on PDF tables and used images, text, and

    layout to tackle the Table QA. Representation of Table Structures 3 Highly structured formats Low structure / unstructured formats <table><tr><td colspan="2"> 回次</td><td>第65期 </td><td>第66期</td><td>第 67期……..</table> HTML, Json, Markdown Image PDF or Image + OCR Most business documents for RAG
  3. 1. Our Strategy: Predicting Cell-id, Not Cell-Value 2. Our Input:

    Fusing 3 Modalities for Precision : Image, Text, Layout Key points of our method 4
  4. We focus on predicting Cell-IDs to bypass the LVLM's weakness

    in math. Cell-IDs (e.g., “r3c1”) are rendered directly onto the table image as visual objects. This transforms the QA task into a more straightforward visual recognition problem. Strategy 1: Cell-ID Embedding 5
  5. We use two additional modalities: Text and Layout. Layout: The

    bounding box coordinates for each text block. Benefit: Avoids complex table structure reconstruction. Strategy 2: Layout and Text Modality 6
  6. • Layout coordinates are encoded into features via an MLP.

    • Each text token is fused with its corresponding layout feature. • These text-layout pairs and image features form the final input for the LLM. Layout-Aware LVLM Architecture 7
  7. 1. General Setup • Base model: LLaVA-OneVision-7B • Fine-tuning: All

    models were fine-tuned on the Table QA dataset for 3 epochs with a learning rate of 1e-5. 2. Ablation Study Conditions • To analyze the impact of each modality, we compared the following four settings. Experiments 8 I+T+L Training with Image, Text and Layout T+L Training with Text and Layout I+T Training with Image and Text I Training with Image w/o Pre-Training
  8. • Our full model (I+T+L) achieved the highest accuracy, confirming

    the effectiveness of the multimodal approach. • The removal of Layout information (I+T) caused a performance drop, highlighting its critical role. Results 9
  9. Problem: Mismatch between cell-id (r2c4) and its actual visual column.

    Without Layout: The model is misled by the inconsistent cell-id. With Layout: Bounding box coordinates reveal the true table structure. Case Study: Why Layout Modality is Crucial 10 r2c4 r6c5
  10. Limitations • Reliance on cell-ids: Not applicable to general real-world

    documents. • Assumption of Clean Text: Not robust to noisy or handwritten tables with OCR errors. Future Work • Direct Value Prediction: To eliminate the dependency on cell-ids. • Robustness for Noisy Documents: By exploring end-to-end models or enhanced OCR. Limitations / Future Work 11
  11. • We proposed a multimodal approach for the Table QA

    task, integrating Image, Text, and Layout information. • Our experiments showed this method is highly effective, with Text and Layout proving to be the most critical modalities for achieving high accuracy. • This study highlights that combining visual, textual, and spatial context is key to robustly understanding complex structured data. Conclusion 12