Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
openthebox.be - smart publications
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Niek Bartholomeus
October 02, 2019
Technology
0
180
openthebox.be - smart publications
Extracting deep insights from boring documents: a real-life story
Niek Bartholomeus
October 02, 2019
Tweet
Share
More Decks by Niek Bartholomeus
See All by Niek Bartholomeus
openthebox.be
niekbartho
1
2.6k
From idea to production with NLP, Scala and Spark
niekbartho
3
490
Going DevOps with BMC
niekbartho
0
210
Orchestration in meatspace
niekbartho
4
2k
Self-organization vs. global optimization - a comparison between traditional and modern organizations
niekbartho
2
490
DevOps for Dinosaurs
niekbartho
12
3k
Other Decks in Technology
See All in Technology
Greatest Disaster Hits in Web Performance
guaca
0
280
コンテナセキュリティの最新事情 ~ 2026年版 ~
kyohmizu
2
370
CDK対応したAWS DevOps Agentを試そう_20260201
masakiokuda
1
370
FinTech SREのAWSサービス活用/Leveraging AWS Services in FinTech SRE
maaaato
0
130
Context Engineeringが企業で不可欠になる理由
hirosatogamo
PRO
3
630
Agent Skils
dip_tech
PRO
0
120
SREのプラクティスを用いた3領域同時 マネジメントへの挑戦 〜SRE・情シス・セキュリティを統合した チーム運営術〜
coconala_engineer
2
700
What happened to RubyGems and what can we learn?
mikemcquaid
0
310
登壇駆動学習のすすめ — CfPのネタの見つけ方と書くときに意識していること
bicstone
3
120
仕様書駆動AI開発の実践: Issue→Skill→PRテンプレで 再現性を作る
knishioka
2
680
ブロックテーマでサイトをリニューアルした話 / 2026-01-31 Kansai WordPress Meetup
torounit
0
480
配列に見る bash と zsh の違い
kazzpapa3
3
160
Featured
See All Featured
Accessibility Awareness
sabderemane
0
55
Designing for Timeless Needs
cassininazir
0
130
Building Adaptive Systems
keathley
44
2.9k
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
4.2k
Information Architects: The Missing Link in Design Systems
soysaucechin
0
780
Building AI with AI
inesmontani
PRO
1
700
Between Models and Reality
mayunak
1
190
Making the Leap to Tech Lead
cromwellryan
135
9.7k
Darren the Foodie - Storyboard
khoart
PRO
2
2.4k
The Hidden Cost of Media on the Web [PixelPalooza 2025]
tammyeverts
2
190
Collaborative Software Design: How to facilitate domain modelling decisions
baasie
0
140
Transcript
openthebox.be Extracting deep insights from 'boring' documents: a real-life story
Me Niek Bartholomeus @niekbartho • Background as a software developer
• Switched to data science and natural language processing in 2016 • Founded openthebox.be in 2017
openthebox.be
openthebox.be Open data KBO NBB Belgian Official Gazette http://kbopub.economie.fgov.be/kbopub https://cri.nbb.be/bc9/web/catalog
http://www.ejustice.just.fgov.be/ tsv/tsvn.htm
knowledge graph Visualization Analytics Machine learning Knowledge graph Structured data
Unstructured data KBO NBB Belgian Official Gazette
Unstructured data - pipeline
Unstructured data - pipeline steps 1] OCR 2] NER 4]
Entity linking 3] Relation extraction
Unstructured data - pipeline steps 1] OCR
Unstructured data - pipeline steps 2] NER
Unstructured data - pipeline steps 2] NER Pre-processing rules: [“1.Jan”,
“Janssens”] 1.Jan Janssens [“Marktstraat”, “54,8450”, “Bredene”] Marktstraat 54,8450 Bredene
Unstructured data - pipeline steps 2] NER Post-processing rules: +
= General rules Legal rules Historic probabilities Faulty publication Context Improved publication
Unstructured data - pipeline steps 2] NER Organization Person Inheritance:
Notary Owner Representative Proxy holder Administrator Author : “is a” relationship Base labels Subclass labels
Unstructured data - pipeline steps 2] NER Gentstraat 69 Niek
Roger Camiel Bartholomeus Sub entity extraction: First name: Niek Middle names: Roger, Camiel Last name: Bartholomeus 9170 Sint-Pauwels Street: Gentstraat Number: 69 Zip code: 9170 City: Sint-Pauwels
Unstructured data - pipeline steps 3] Relation extraction
Unstructured data - pipeline steps 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Niek
Bartholomeus N. Bartholomeus Bartholomeus } Niek Roger Camiel Bartholomeus Deduplication: 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Link
with knowledge graph: Gentstraat 69 9170 Sint-Pauwels 4] Entity linking
openthebox.be
openthebox.be Bigger picture
openthebox.be http://wpmlabs.com/ Academia Industry https://www.filter-concept.com/ +
openthebox.be https://opensenselabs.com