Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
openthebox.be - smart publications
Search
Niek Bartholomeus
October 02, 2019
Technology
0
160
openthebox.be - smart publications
Extracting deep insights from boring documents: a real-life story
Niek Bartholomeus
October 02, 2019
Tweet
Share
More Decks by Niek Bartholomeus
See All by Niek Bartholomeus
openthebox.be
niekbartho
1
2.5k
From idea to production with NLP, Scala and Spark
niekbartho
3
470
Going DevOps with BMC
niekbartho
0
190
Orchestration in meatspace
niekbartho
4
2k
Self-organization vs. global optimization - a comparison between traditional and modern organizations
niekbartho
2
450
DevOps for Dinosaurs
niekbartho
12
3k
Other Decks in Technology
See All in Technology
【CEDEC2025】『Shadowverse: Worlds Beyond』二度目のDCG開発でゲームをリデザインする~遊びやすさと競技性の両立~
cygames
PRO
1
290
Nx × AI によるモノレポ活用 〜コードジェネレーター編〜
puku0x
0
340
2025新卒研修・HTML/CSS #弁護士ドットコム
bengo4com
3
13k
2025-07-31: GitHub Copilot Agent mode at Vibe Coding Cafe (15min)
chomado
2
370
AWS re:Inforce 2025 re:Cap Update Pickup & AWS Control Tower の運用における考慮ポイント
htan
1
210
ビジネス文書に特化した基盤モデル開発 / SaaSxML_Session_2
sansan_randd
0
260
Bet "Bet AI" - Accelerating Our AI Journey #BetAIDay
layerx
PRO
4
1.5k
【CEDEC2025】大規模言語モデルを活用したゲーム内会話パートのスクリプト作成支援への取り組み
cygames
PRO
2
770
モバイルゲームの開発を支える基盤の歩み ~再現性のある開発ラインを量産する秘訣~
qualiarts
0
1.1k
Claude Codeから我々が学ぶべきこと
s4yuba
8
2k
【CEDEC2025】『ウマ娘 プリティーダービー』における映像制作のさらなる高品質化へ!~ 豊富な素材出力と制作フローの改善を実現するツールについて~
cygames
PRO
0
230
生成AI時代におけるAI・機械学習技術を用いたプロダクト開発の深化と進化 #BetAIDay
layerx
PRO
1
1k
Featured
See All Featured
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
29
2.8k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
47
9.6k
Making Projects Easy
brettharned
117
6.3k
Adopting Sorbet at Scale
ufuk
77
9.5k
It's Worth the Effort
3n
185
28k
Designing Experiences People Love
moore
142
24k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
248
1.3M
Git: the NoSQL Database
bkeepers
PRO
431
65k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
3.9k
Building a Modern Day E-commerce SEO Strategy
aleyda
43
7.4k
The World Runs on Bad Software
bkeepers
PRO
70
11k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
139
34k
Transcript
openthebox.be Extracting deep insights from 'boring' documents: a real-life story
Me Niek Bartholomeus @niekbartho • Background as a software developer
• Switched to data science and natural language processing in 2016 • Founded openthebox.be in 2017
openthebox.be
openthebox.be Open data KBO NBB Belgian Official Gazette http://kbopub.economie.fgov.be/kbopub https://cri.nbb.be/bc9/web/catalog
http://www.ejustice.just.fgov.be/ tsv/tsvn.htm
knowledge graph Visualization Analytics Machine learning Knowledge graph Structured data
Unstructured data KBO NBB Belgian Official Gazette
Unstructured data - pipeline
Unstructured data - pipeline steps 1] OCR 2] NER 4]
Entity linking 3] Relation extraction
Unstructured data - pipeline steps 1] OCR
Unstructured data - pipeline steps 2] NER
Unstructured data - pipeline steps 2] NER Pre-processing rules: [“1.Jan”,
“Janssens”] 1.Jan Janssens [“Marktstraat”, “54,8450”, “Bredene”] Marktstraat 54,8450 Bredene
Unstructured data - pipeline steps 2] NER Post-processing rules: +
= General rules Legal rules Historic probabilities Faulty publication Context Improved publication
Unstructured data - pipeline steps 2] NER Organization Person Inheritance:
Notary Owner Representative Proxy holder Administrator Author : “is a” relationship Base labels Subclass labels
Unstructured data - pipeline steps 2] NER Gentstraat 69 Niek
Roger Camiel Bartholomeus Sub entity extraction: First name: Niek Middle names: Roger, Camiel Last name: Bartholomeus 9170 Sint-Pauwels Street: Gentstraat Number: 69 Zip code: 9170 City: Sint-Pauwels
Unstructured data - pipeline steps 3] Relation extraction
Unstructured data - pipeline steps 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Niek
Bartholomeus N. Bartholomeus Bartholomeus } Niek Roger Camiel Bartholomeus Deduplication: 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Link
with knowledge graph: Gentstraat 69 9170 Sint-Pauwels 4] Entity linking
openthebox.be
openthebox.be Bigger picture
openthebox.be http://wpmlabs.com/ Academia Industry https://www.filter-concept.com/ +
openthebox.be https://opensenselabs.com