Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
openthebox.be - smart publications
Search
Niek Bartholomeus
October 02, 2019
Technology
190
0
Share
openthebox.be - smart publications
Extracting deep insights from boring documents: a real-life story
Niek Bartholomeus
October 02, 2019
More Decks by Niek Bartholomeus
See All by Niek Bartholomeus
openthebox.be
niekbartho
1
2.6k
From idea to production with NLP, Scala and Spark
niekbartho
3
510
Going DevOps with BMC
niekbartho
0
210
Orchestration in meatspace
niekbartho
4
2k
Self-organization vs. global optimization - a comparison between traditional and modern organizations
niekbartho
2
500
DevOps for Dinosaurs
niekbartho
12
3.1k
Other Decks in Technology
See All in Technology
AWS WAFの運用を地道に改善し、自社で運用可能にするプラクティス
andpad
1
240
「背中を見て育て」からの卒業 〜専門技術としてのテスト設計を軸に、品質保証のバトンを繋ぐ〜 #genda_tech_talk
nihonbuson
PRO
3
1.4k
Sociotechnical Architecture Reviews: Understanding Teams, not just Artefacts
ewolff
1
180
クラウドネイティブ DB はいかにして制約を 克服したか? 〜進化歴史から紐解く、スケーラブルアーキテクチャ設計指針〜
hacomono
PRO
6
1k
freeeで運用しているAIQAについて
qatonchan
1
630
Tachikawa.any 運営挨拶
daitasu
0
180
続 運用改善、不都合な真実 〜 物理制約のない運用改善はほとんど無価値 / 20260518-ssmjp-kaizen-no-value-without-physical-constraints
opelab
2
230
React Compiler導入の効果と運用の工夫
kakehashi
PRO
3
190
How to learn AWS Well-Architected with AWS BuilderCards: Security Edition
coosuke
PRO
0
150
100マイクロサービスのTerraform/Kubernetes管理地獄から抜け出すためのAI活用術
markie1009
0
160
Oracle Cloud Infrastructure presents managed, serverless MCP Servers for Oracle AI Database
thatjeffsmith
1
340
Agent Skillsで実現する記憶領域の運用とその後
yamadashy
2
1.9k
Featured
See All Featured
The Organizational Zoo: Understanding Human Behavior Agility Through Metaphoric Constructive Conversations (based on the works of Arthur Shelley, Ph.D)
kimpetersen
PRO
0
320
Principles of Awesome APIs and How to Build Them.
keavy
128
17k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
141
35k
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs
inesmontani
PRO
3
3.4k
The Hidden Cost of Media on the Web [PixelPalooza 2025]
tammyeverts
2
300
30 Presentation Tips
portentint
PRO
1
290
Lightning talk: Run Django tests with GitHub Actions
sabderemane
0
180
From Legacy to Launchpad: Building Startup-Ready Communities
dugsong
0
210
Bridging the Design Gap: How Collaborative Modelling removes blockers to flow between stakeholders and teams @FastFlow conf
baasie
0
550
Bash Introduction
62gerente
615
210k
Digital Ethics as a Driver of Design Innovation
axbom
PRO
1
280
Producing Creativity
orderedlist
PRO
348
40k
Transcript
openthebox.be Extracting deep insights from 'boring' documents: a real-life story
Me Niek Bartholomeus @niekbartho • Background as a software developer
• Switched to data science and natural language processing in 2016 • Founded openthebox.be in 2017
openthebox.be
openthebox.be Open data KBO NBB Belgian Official Gazette http://kbopub.economie.fgov.be/kbopub https://cri.nbb.be/bc9/web/catalog
http://www.ejustice.just.fgov.be/ tsv/tsvn.htm
knowledge graph Visualization Analytics Machine learning Knowledge graph Structured data
Unstructured data KBO NBB Belgian Official Gazette
Unstructured data - pipeline
Unstructured data - pipeline steps 1] OCR 2] NER 4]
Entity linking 3] Relation extraction
Unstructured data - pipeline steps 1] OCR
Unstructured data - pipeline steps 2] NER
Unstructured data - pipeline steps 2] NER Pre-processing rules: [“1.Jan”,
“Janssens”] 1.Jan Janssens [“Marktstraat”, “54,8450”, “Bredene”] Marktstraat 54,8450 Bredene
Unstructured data - pipeline steps 2] NER Post-processing rules: +
= General rules Legal rules Historic probabilities Faulty publication Context Improved publication
Unstructured data - pipeline steps 2] NER Organization Person Inheritance:
Notary Owner Representative Proxy holder Administrator Author : “is a” relationship Base labels Subclass labels
Unstructured data - pipeline steps 2] NER Gentstraat 69 Niek
Roger Camiel Bartholomeus Sub entity extraction: First name: Niek Middle names: Roger, Camiel Last name: Bartholomeus 9170 Sint-Pauwels Street: Gentstraat Number: 69 Zip code: 9170 City: Sint-Pauwels
Unstructured data - pipeline steps 3] Relation extraction
Unstructured data - pipeline steps 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Niek
Bartholomeus N. Bartholomeus Bartholomeus } Niek Roger Camiel Bartholomeus Deduplication: 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Link
with knowledge graph: Gentstraat 69 9170 Sint-Pauwels 4] Entity linking
openthebox.be
openthebox.be Bigger picture
openthebox.be http://wpmlabs.com/ Academia Industry https://www.filter-concept.com/ +
openthebox.be https://opensenselabs.com