Upgrade to Pro — share decks privately, control downloads, hide ads and more …

openthebox.be - smart publications

openthebox.be - smart publications

Extracting deep insights from boring documents: a real-life story

Niek Bartholomeus

October 02, 2019
Tweet

More Decks by Niek Bartholomeus

Other Decks in Technology

Transcript

  1. Me Niek Bartholomeus @niekbartho • Background as a software developer

    • Switched to data science and natural language processing in 2016 • Founded openthebox.be in 2017
  2. Unstructured data - pipeline steps 1] OCR 2] NER 4]

    Entity linking 3] Relation extraction
  3. Unstructured data - pipeline steps 2] NER Pre-processing rules: [“1.Jan”,

    “Janssens”] 1.Jan Janssens [“Marktstraat”, “54,8450”, “Bredene”] Marktstraat 54,8450 Bredene
  4. Unstructured data - pipeline steps 2] NER Post-processing rules: +

    = General rules Legal rules Historic probabilities Faulty publication Context Improved publication
  5. Unstructured data - pipeline steps 2] NER Organization Person Inheritance:

    Notary Owner Representative Proxy holder Administrator Author : “is a” relationship Base labels Subclass labels
  6. Unstructured data - pipeline steps 2] NER Gentstraat 69 Niek

    Roger Camiel Bartholomeus Sub entity extraction: First name: Niek Middle names: Roger, Camiel Last name: Bartholomeus 9170 Sint-Pauwels Street: Gentstraat Number: 69 Zip code: 9170 City: Sint-Pauwels
  7. Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Niek

    Bartholomeus N. Bartholomeus Bartholomeus } Niek Roger Camiel Bartholomeus Deduplication: 4] Entity linking
  8. Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Link

    with knowledge graph: Gentstraat 69 9170 Sint-Pauwels 4] Entity linking