Slide 1

Slide 1 text

openthebox.be Extracting deep insights from 'boring' documents: a real-life story

Slide 2

Slide 2 text

Me Niek Bartholomeus @niekbartho • Background as a software developer • Switched to data science and natural language processing in 2016 • Founded openthebox.be in 2017

Slide 3

Slide 3 text

openthebox.be

Slide 4

Slide 4 text

openthebox.be Open data KBO NBB Belgian Official Gazette http://kbopub.economie.fgov.be/kbopub https://cri.nbb.be/bc9/web/catalog http://www.ejustice.just.fgov.be/ tsv/tsvn.htm

Slide 5

Slide 5 text

knowledge graph Visualization Analytics Machine learning Knowledge graph Structured data Unstructured data KBO NBB Belgian Official Gazette

Slide 6

Slide 6 text

Unstructured data - pipeline

Slide 7

Slide 7 text

Unstructured data - pipeline steps 1] OCR 2] NER 4] Entity linking 3] Relation extraction

Slide 8

Slide 8 text

Unstructured data - pipeline steps 1] OCR

Slide 9

Slide 9 text

Unstructured data - pipeline steps 2] NER

Slide 10

Slide 10 text

Unstructured data - pipeline steps 2] NER Pre-processing rules: [“1.Jan”, “Janssens”] 1.Jan Janssens [“Marktstraat”, “54,8450”, “Bredene”] Marktstraat 54,8450 Bredene

Slide 11

Slide 11 text

Unstructured data - pipeline steps 2] NER Post-processing rules: + = General rules Legal rules Historic probabilities Faulty publication Context Improved publication

Slide 12

Slide 12 text

Unstructured data - pipeline steps 2] NER Organization Person Inheritance: Notary Owner Representative Proxy holder Administrator Author : “is a” relationship Base labels Subclass labels

Slide 13

Slide 13 text

Unstructured data - pipeline steps 2] NER Gentstraat 69 Niek Roger Camiel Bartholomeus Sub entity extraction: First name: Niek Middle names: Roger, Camiel Last name: Bartholomeus 9170 Sint-Pauwels Street: Gentstraat Number: 69 Zip code: 9170 City: Sint-Pauwels

Slide 14

Slide 14 text

Unstructured data - pipeline steps 3] Relation extraction

Slide 15

Slide 15 text

Unstructured data - pipeline steps 4] Entity linking

Slide 16

Slide 16 text

Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Niek Bartholomeus N. Bartholomeus Bartholomeus } Niek Roger Camiel Bartholomeus Deduplication: 4] Entity linking

Slide 17

Slide 17 text

Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Link with knowledge graph: Gentstraat 69 9170 Sint-Pauwels 4] Entity linking

Slide 18

Slide 18 text

openthebox.be

Slide 19

Slide 19 text

openthebox.be Bigger picture

Slide 20

Slide 20 text

openthebox.be http://wpmlabs.com/ Academia Industry https://www.filter-concept.com/ +

Slide 21

Slide 21 text

openthebox.be https://opensenselabs.com