Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Natural Language Pipeline

July 06, 2019

A Natural Language Pipeline

Presentation from the spaCy IRL 2019 conference.


July 06, 2019

Other Decks in Technology


  1. A Natural Language Pipeline

  2. More Input

  3. Knowledge” “A compendium of human...

  4. Library

  5. Physical archives became digital records, encoded with metadata

  6. The internet promised rich dynamic experiences

  7. The internet promised rich dynamic experiences but served us banner

  8. Advertising has and continues to fuel a substantial portion of

    the innovation on the internet
  9. What would The Economist look like if it were founded

    in 2012?
  10. User

  11. First

  12. Experience

  13. “There’s a reason that tech companies are topping the lists

    of most valuable companies and brands. Every company is a tech company.” Maggie Chan Jones
  14. Every story, at its core, is a business story

  15. Language

  16. None
  17. None
  18. Stage -> Stenographer -> Editors -> spaCy -> Data Store

    <-> Backend <- Slack <- Users Proto-Pipeline
  19. Over eight hours we created data from the content of

    the event, building the model in real-time
  20. The model evolved over time

  21. This was the experiment that would evolve into SiO 2

  22. Silicon, a key element in everything from glass to microchips,

    is at the core of global business
  23. Oxygen, the journalistic voice Quartz breathes into the global business

    news cycle
  24. Entities are linguistic anchors, defined by context and around which

    context can be inferred
  25. Standard Entities PERSON FACILITY ORG PRODUCT GPE EVENT... Additional Entities

  26. 70K articles 1.4M blocks of text 85K labeled sentences

  27. Entities

  28. This spaCy model made rich analysis for any given text

    easy to do on the fly
  29. Stored analysis of a large corpus is a vital resource

  30. The language graph...

  31. Graph

  32. The language graph is a mutable map of the language

  33. Any new content is analyzed and then mapped onto the

    language graph
  34. Changes made to the graph can then be incorporated into

    the next model iteration
  35. The language graph becomes a primary resource for extracting training

  36. Snapshots of time can be extracted from the language graph

  37. Context can be derived by looking at the relationships in

    the language graph
  38. Elon Musk

  39. Jeff Bezos

  40. Mark Zuckerberg

  41. Context

  42. SiO 2 is a living Natural Language Pipeline of networked

    algorithms trained on the corpus of Quartz to understand the linguistic patterns of global business news
  43. The Pipeline(s) Quartz Corpus -> Training Sentences -> spaCy Content

    -> spaCy -> Language Graph Language Graph -> Training Data -> Statistical Models / Classifiers Language Graph -> Training Sentences -> spaCy Unseen Content -> spaCy -> Pre-Processed Text / Vectors -> Statistical Models / Classifiers
  44. Thank you