Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
openthebox.be - smart publications
Search
Niek Bartholomeus
October 02, 2019
Technology
0
170
openthebox.be - smart publications
Extracting deep insights from boring documents: a real-life story
Niek Bartholomeus
October 02, 2019
Tweet
Share
More Decks by Niek Bartholomeus
See All by Niek Bartholomeus
openthebox.be
niekbartho
1
2.5k
From idea to production with NLP, Scala and Spark
niekbartho
3
470
Going DevOps with BMC
niekbartho
0
190
Orchestration in meatspace
niekbartho
4
2k
Self-organization vs. global optimization - a comparison between traditional and modern organizations
niekbartho
2
460
DevOps for Dinosaurs
niekbartho
12
3k
Other Decks in Technology
See All in Technology
[ JAWS-UG 東京 CommunityBuilders Night #2 ]SlackとAmazon Q Developerで 運用効率化を模索する
sh_fk2
3
400
品質視点から考える組織デザイン/Organizational Design from Quality
mii3king
0
200
なぜテストマネージャの視点が 必要なのか? 〜 一歩先へ進むために 〜
moritamasami
0
220
roppongirb_20250911
igaiga
1
220
dbt開発 with Claude Codeのためのガードレール設計
10xinc
2
1.2k
Firestore → Spanner 移行 を成功させた段階的移行プロセス
athug
1
470
Webブラウザ向け動画配信プレイヤーの 大規模リプレイスから得た知見と学び
yud0uhu
0
230
生成AIでセキュリティ運用を効率化する話
sakaitakeshi
0
670
企業の生成AIガバナンスにおけるエージェントとセキュリティ
lycorptech_jp
PRO
2
160
DDD集約とサービスコンテキスト境界との関係性
pandayumi
3
280
KotlinConf 2025_イベントレポート
sony
1
130
Autonomous Database - Dedicated 技術詳細 / adb-d_technical_detail_jp
oracle4engineer
PRO
4
10k
Featured
See All Featured
For a Future-Friendly Web
brad_frost
180
9.9k
Balancing Empowerment & Direction
lara
3
620
A designer walks into a library…
pauljervisheath
207
24k
VelocityConf: Rendering Performance Case Studies
addyosmani
332
24k
Product Roadmaps are Hard
iamctodd
PRO
54
11k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
15k
How to train your dragon (web standard)
notwaldorf
96
6.2k
A better future with KSS
kneath
239
17k
Reflections from 52 weeks, 52 projects
jeffersonlam
352
21k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
139
34k
The Pragmatic Product Professional
lauravandoore
36
6.9k
Building a Modern Day E-commerce SEO Strategy
aleyda
43
7.6k
Transcript
openthebox.be Extracting deep insights from 'boring' documents: a real-life story
Me Niek Bartholomeus @niekbartho • Background as a software developer
• Switched to data science and natural language processing in 2016 • Founded openthebox.be in 2017
openthebox.be
openthebox.be Open data KBO NBB Belgian Official Gazette http://kbopub.economie.fgov.be/kbopub https://cri.nbb.be/bc9/web/catalog
http://www.ejustice.just.fgov.be/ tsv/tsvn.htm
knowledge graph Visualization Analytics Machine learning Knowledge graph Structured data
Unstructured data KBO NBB Belgian Official Gazette
Unstructured data - pipeline
Unstructured data - pipeline steps 1] OCR 2] NER 4]
Entity linking 3] Relation extraction
Unstructured data - pipeline steps 1] OCR
Unstructured data - pipeline steps 2] NER
Unstructured data - pipeline steps 2] NER Pre-processing rules: [“1.Jan”,
“Janssens”] 1.Jan Janssens [“Marktstraat”, “54,8450”, “Bredene”] Marktstraat 54,8450 Bredene
Unstructured data - pipeline steps 2] NER Post-processing rules: +
= General rules Legal rules Historic probabilities Faulty publication Context Improved publication
Unstructured data - pipeline steps 2] NER Organization Person Inheritance:
Notary Owner Representative Proxy holder Administrator Author : “is a” relationship Base labels Subclass labels
Unstructured data - pipeline steps 2] NER Gentstraat 69 Niek
Roger Camiel Bartholomeus Sub entity extraction: First name: Niek Middle names: Roger, Camiel Last name: Bartholomeus 9170 Sint-Pauwels Street: Gentstraat Number: 69 Zip code: 9170 City: Sint-Pauwels
Unstructured data - pipeline steps 3] Relation extraction
Unstructured data - pipeline steps 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Niek
Bartholomeus N. Bartholomeus Bartholomeus } Niek Roger Camiel Bartholomeus Deduplication: 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Link
with knowledge graph: Gentstraat 69 9170 Sint-Pauwels 4] Entity linking
openthebox.be
openthebox.be Bigger picture
openthebox.be http://wpmlabs.com/ Academia Industry https://www.filter-concept.com/ +
openthebox.be https://opensenselabs.com