Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Document Analysis Layout Analysis, OCR and NLP

Avatar for Mahdi Khashan Mahdi Khashan
June 02, 2025
2

Document Analysis Layout Analysis, OCR and NLP

TU Wien 183.628 Document Analysis SS2025

Avatar for Mahdi Khashan

Mahdi Khashan

June 02, 2025
Tweet

Transcript

  1. Task B - OCR • PSM out of the box

    • Requires Preprocessing • Ease of use • Probably better results Vision Language Models • Better accuracy in complex layouts • Different fonts and styles • No PSM • Challenging usage
  2. Task B - OCR - Full Image Mean WER =

    0.984 , Mean CER = 0.984
  3. Task B - OCR - Color Cropped Mean WER =

    1.009 , Mean CER = 0.817
  4. Task B - OCR - Binarized, Cropped with Line Detection

    Neunkirchner Alleerte -------------------------------- Aufn. Dr . Machura 1948. Film Nr . 17 . Neg. 17 / WER = 1.83 CER = 0.789 A Gr. Riedenthal Tobacco Co. mit \"Neun Mauna\" und'e de NOMTAY JOURNAL, J. W. W. LIMA VV- 4 0.0 0.0 0.0 0.0 0 Aufn. Meisinger 1956. Rollfilm Nr. 39, Neg. Nr. 10,9. WER = 2.75 CER = 0.82
  5. Task C - NLP (Spacy with “de_core_news_lg”) 0001 Neunkirchner Alleerte

    -------------------------------- Aufn. Dr . Machura 1948. Film Nr . 17 . Neg. 17 / Neunkirchner Alleerte - MISC Nr - MISC Machura - PER Neg - PER 0005 Massachusetts, Massachusetts Flugerdeverwenungen.co m CAUTH. Meisinger, J.W. Rollfilm 4x6. N. Sch. K. 258 Massachusetts - LOC Massachusetts - LOC Flugerdeverwenungen.co m\nCAUTH - MISC Meisinger - PER N. Sch - PER 0030 A Gr. Riedenthal Tobacco Co. mit \"Neun Mauna\" und'e de NOMTAY JOURNAL, J. W. W. LIMA VV- 4 0.0 0.0 0.0 0.0 0 Aufn. Meisinger 1956. Rollfilm Nr. 39, Neg. Nr. 10,9. A Gr - MISC Nr. 39 - MISC Tobacco - ORG LIMA VV- - ORG Mauna - LOC Neg - LOC J. W. W. - PER Meisinger - PER
  6. Task C - NLP (LLM) Gemini 2.5 Pro GPT o-4

    gemma-3-4b-it (Local Inference) you are an NER model, your task is to find following information on a german text: Location, Description, Date, Photographer, Film return the result as a JSON, use the following format: { "Location": "Helenental", "Description": "unerlaubte Rodung im Schutzgebiet", "Date": "April 1948", "Photographer": "Meisinger", "Film": "Neg.Nr. 3254/KIX/16, Film", } I will send text, are you ready? you are an NER model, your task is to find following information on a german text: Location, Description, Date, Photographer, Film return the result as a JSON, use the following format: { "Location": "Helenental", "Description": "unerlaubte Rodung im Schutzgebiet", "Date": "April 1948", "Photographer": "Meisinger", "Film": "Neg.Nr. 3254/KIX/16, Film", } I will send text, are you ready? can you act as NER and find information about this keys in the following text, the language of the text id german: 1. Location 2. Description 3. Date 4. Photographer 5. Film text: Regelsorunner Au.S.A.A. Aufn. Dr. Machura 1948 Film Nr. 19, Neg, Nr. 21. u. 22. your output should be a json like this: { "Location": "", "Description": "", "Date": "", "Photographer": "", "Film": "", } { "Location": "Neunkirchner Allee", "Description": "", "Date": "1948", "Photographer": "Dr. Machura", "Film": "Film Nr. 17, Neg. 17 /" } { "Location": "Neunkirchner Allee", "Description": "", "Date": "1948", "Photographer": "Dr. Machura", "Film": "Film Nr. 17, Neg. 17" } { "Location": "Aufn. Dr. Machura", "Description": "Regelsorunner Au.S.A.A.", "Date": "1948", "Photographer": null, "Film": "Film Nr. 19, Neg, Nr. 21. u. 22." }
  7. Task C - NLP (Results) LLM (Agent) Spacy • Great

    result on Large models • Larger model, better context, better result • Trained on Huge data • Cost (Hardware and time) • Better on Location labeling • Poor results • Unrelated dataset (news) • PCA helped, but still low variance (high overlap) • PER PCA (41% information for both pcas) • Better OCR, better NLP • Person names include "Dr. Machura", "Meisinger", "Kupelwieser". • Many location names appear to be Austrian or German ("Neunkirchner Allee", "Klamm bei Schottwien", "Schneeberg", "Lilienfeld-Klosteralm" )
  8. Reflection • Some documents contained more than 1 printed text

    region • trOCR accuracy decreased for later artifacts • Metadata caches everything, my disk got full multiple time, resulting in failure of pipeline • Spacy vectors were all null! • PSM, probably the bottleneck of my solution • Baseline model in OCR can be helpful, useful for validation • I don’t know how to debug OCR and NLP models and validate their results • T-SNE: 88266 segmentation fault python tsne.py