Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Transforming Raw Files into Structured Data for...

Transforming Raw Files into Structured Data for Your LLM

In this presentation, I explain how to build a document processing pipeline to create structured data for usage in LLMs.

This version of the talk was given at IBM TechXchange, in October 2025.

Avatar for Kerim Satirli

Kerim Satirli PRO

October 08, 2025
Tweet

More Decks by Kerim Satirli

Other Decks in Programming

Transcript

  1. Orlando, FL October 6–9 IBM TechXchange 2025 Session code 4254


    
 Kerim Satirli
 Developer Advocate HashiCorp Transforming Raw Files into Structured Data for Your LLM
  2. IBM TechXchange | © 2025 IBM Corporation Agenda 01 02

    03 How it started Let's Build Next Steps
  3. IBM TechXchange | © 2025 IBM Corporation How do we

    access the domain knowledge of a project? 8
  4. IBM TechXchange | © 2025 IBM Corporation How do we

    access the domain knowledge that we haven't memorized? 9
  5. IBM TechXchange | © 2025 IBM Corporation Project Files 14

    TODO TODO sensors.log LOG _FLAGS.txt TXT firmware.txt TXT config.txt TXT coords.csv CSV otaupdate.sh esp-idf.pdf esp32.pdf vault.md admin.md spec.md battery.md swarm.md api.md ledctrl.c Pathfinder doc.pdf lidar.py routes.md tests.xslx imu_cal.csv CSV
  6. IBM TechXchange | © 2025 IBM Corporation Project Files 16

    Chat Interface End user Response LLM Model Prompt
  7. IBM TechXchange | © 2025 IBM Corporation Project Files 17

    Chat Interface End user Response LLM Model Document Request Prompt
  8. IBM TechXchange | © 2025 IBM Corporation Project Files 18

    Chat Interface End user Response LLM Model Document Request Prompt 1. Context length
  9. IBM TechXchange | © 2025 IBM Corporation Project Files 19

    Chat Interface End user Response LLM Model Document Request Prompt 1. Context length 2. Enough context?
  10. IBM TechXchange | © 2025 IBM Corporation Project Files 20

    Chat Interface End user Response LLM Model Document Request Prompt 1. Context length 2. Enough context? 3. Info readable?
  11. IBM TechXchange | © 2025 IBM Corporation Project Files 21

    Chat Interface End user Response LLM Model Document Request Prompt 1. Context length 2. Enough context? 3. Info readable? 4. Sensitive data
  12. IBM TechXchange | © 2025 IBM Corporation Agenda 01 02

    03 How it started Let's build Next Steps
  13. IBM TechXchange | © 2025 IBM Corporation Data flow 23

    Chat Interface End User Response LLM Model Document Request Prompt
  14. IBM TechXchange | © 2025 IBM Corporation Request flow 26

    Vector DB Chat Interface End User Response LLM Model Prompt Request
  15. IBM TechXchange | © 2025 IBM Corporation Request flow 27

    Chat Interface End User Response LLM Model Vector DB Prompt Request Document
  16. IBM TechXchange | © 2025 IBM Corporation Document flow 29

    LLM Model Chat Interface End User Response Prompt Vector DB Request Document
  17. IBM TechXchange | © 2025 IBM Corporation Document flow 30

    End User LLM Model Vector DB Prompt Response Documents Docling Document Processor Chat Interface
  18. IBM TechXchange | © 2025 IBM Corporation 31 End User

    LLM Model Vector DB Prompt Response Documents Docling Document Processor Chat Interface HashiCorp Vault Document flow
  19. IBM TechXchange | © 2025 IBM Corporation 32 End User

    Vector DB Prompt Response Documents Docling Document Processor Chat Interface HashiCorp Vault Ollama IBM Granite Document flow
  20. IBM TechXchange | © 2025 IBM Corporation 33 End User

    Prompt Response Documents Docling Document Processor Chat Interface HashiCorp Vault Ollama IBM Granite Document flow Vector DB
  21. IBM TechXchange | © 2025 IBM Corporation 34 End User

    Knowledge Base Prompt Response Documents Docling Document Processor Chat Interface HashiCorp Vault Ollama IBM Granite Document flow
  22. IBM TechXchange | © 2025 IBM Corporation 35 End User

    Knowledge Base Prompt Response Docling Document Processor HashiCorp Vault Ollama IBM Granite Open WebUI Documents /uploads /processed Document flow
  23. IBM TechXchange | © 2025 IBM Corporation Demo 37 ./scripts/deploy.sh

    [SUCCESS] Phase 2 completed! All applications deployed successfully! [INFO] Workshop === WORKSHOP DEPLOYMENT COMPLETED === Access URLs: Web Upload App: pathfinder-prism.svcs.dev:3000 Open WebUI: pathfinder-prism.svcs.dev:8080 Nomad UI: pathfinder-prism.svcs.dev:4646 Azure Storage Account: zepf3rfa [SUCCESS] Workshop deployment completed! Terminal
  24. IBM TechXchange | © 2025 IBM Corporation Agenda 01 02

    03 How it started Let's build Next Steps
  25. IBM TechXchange | © 2025 IBM Corporation 47 bad experience

    garbage inconsistency tools req'd Lessons Learned
  26. IBM TechXchange | © 2025 IBM Corporation How to continue

    your learning journey 1 3640 Deep Dive into Docling with the Core Dev Team 2 3427 Empowering Local AI with IBM Granite HashiCorp Sandbox Booth 300 Product demos and Practitioner Guidance 51