openthebox.be - smart publications

openthebox.be - smart publications

Extracting deep insights from boring documents: a real-life story

29df1d6ab49d2bb0a1eb296d259121d8?s=128

Niek Bartholomeus

October 02, 2019
Tweet

Transcript

  1. openthebox.be Extracting deep insights from 'boring' documents: a real-life story

  2. Me Niek Bartholomeus @niekbartho • Background as a software developer

    • Switched to data science and natural language processing in 2016 • Founded openthebox.be in 2017
  3. openthebox.be

  4. openthebox.be Open data KBO NBB Belgian Official Gazette http://kbopub.economie.fgov.be/kbopub https://cri.nbb.be/bc9/web/catalog

    http://www.ejustice.just.fgov.be/ tsv/tsvn.htm
  5. knowledge graph Visualization Analytics Machine learning Knowledge graph Structured data

    Unstructured data KBO NBB Belgian Official Gazette
  6. Unstructured data - pipeline

  7. Unstructured data - pipeline steps 1] OCR 2] NER 4]

    Entity linking 3] Relation extraction
  8. Unstructured data - pipeline steps 1] OCR

  9. Unstructured data - pipeline steps 2] NER

  10. Unstructured data - pipeline steps 2] NER Pre-processing rules: [“1.Jan”,

    “Janssens”] 1.Jan Janssens [“Marktstraat”, “54,8450”, “Bredene”] Marktstraat 54,8450 Bredene
  11. Unstructured data - pipeline steps 2] NER Post-processing rules: +

    = General rules Legal rules Historic probabilities Faulty publication Context Improved publication
  12. Unstructured data - pipeline steps 2] NER Organization Person Inheritance:

    Notary Owner Representative Proxy holder Administrator Author : “is a” relationship Base labels Subclass labels
  13. Unstructured data - pipeline steps 2] NER Gentstraat 69 Niek

    Roger Camiel Bartholomeus Sub entity extraction: First name: Niek Middle names: Roger, Camiel Last name: Bartholomeus 9170 Sint-Pauwels Street: Gentstraat Number: 69 Zip code: 9170 City: Sint-Pauwels
  14. Unstructured data - pipeline steps 3] Relation extraction

  15. Unstructured data - pipeline steps 4] Entity linking

  16. Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Niek

    Bartholomeus N. Bartholomeus Bartholomeus } Niek Roger Camiel Bartholomeus Deduplication: 4] Entity linking
  17. Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Link

    with knowledge graph: Gentstraat 69 9170 Sint-Pauwels 4] Entity linking
  18. openthebox.be

  19. openthebox.be Bigger picture

  20. openthebox.be http://wpmlabs.com/ Academia Industry https://www.filter-concept.com/ +

  21. openthebox.be https://opensenselabs.com