$30 off During Our Annual Pro Sale. View Details »

A Natural Language Pipeline

ddqz
July 06, 2019

A Natural Language Pipeline

Presentation from the spaCy IRL 2019 conference.

ddqz

July 06, 2019
Tweet

Other Decks in Technology

Transcript

  1. A Natural Language Pipeline

    View Slide

  2. More Input

    View Slide

  3. Knowledge”
    “A compendium
    of human...

    View Slide

  4. Library

    View Slide

  5. Physical archives became digital records,
    encoded with metadata

    View Slide

  6. The internet promised rich dynamic experiences

    View Slide

  7. The internet promised rich dynamic experiences
    but served us banner ads

    View Slide

  8. Advertising has and continues to fuel a substantial
    portion of the innovation on the internet

    View Slide

  9. What would The Economist look like if it were
    founded in 2012?

    View Slide

  10. User

    View Slide

  11. First

    View Slide

  12. Experience

    View Slide

  13. “There’s a reason that tech companies are topping the
    lists of most valuable companies and brands. Every
    company is a tech company.”
    Maggie Chan Jones

    View Slide

  14. Every story, at its core, is a business story

    View Slide

  15. Language

    View Slide

  16. View Slide

  17. View Slide

  18. Stage -> Stenographer -> Editors -> spaCy -> Data Store <-> Backend <- Slack <- Users
    Proto-Pipeline

    View Slide

  19. Over eight hours we created data from the content
    of the event, building the model in real-time

    View Slide

  20. The model evolved over time

    View Slide

  21. This was the experiment that would evolve into SiO
    2

    View Slide

  22. Silicon, a key element in everything from glass to
    microchips, is at the core of global business

    View Slide

  23. Oxygen, the journalistic voice Quartz breathes into
    the global business news cycle

    View Slide

  24. Entities are linguistic anchors, defined by context and
    around which context can be inferred

    View Slide

  25. Standard Entities
    PERSON FACILITY
    ORG PRODUCT
    GPE EVENT...
    Additional Entities
    TECHNOLOGY PROCESS
    NATURE MEDIA
    CONSTRUCT

    View Slide

  26. 70K articles
    1.4M blocks of text
    85K labeled sentences

    View Slide

  27. Entities

    View Slide

  28. This spaCy model made rich analysis for any given
    text easy to do on the fly

    View Slide

  29. Stored analysis of a large corpus is a vital resource

    View Slide

  30. The language graph...

    View Slide

  31. Graph

    View Slide

  32. The language graph is a mutable map of the language
    model

    View Slide

  33. Any new content is analyzed and then mapped onto
    the language graph

    View Slide

  34. Changes made to the graph can then be incorporated
    into the next model iteration

    View Slide

  35. The language graph becomes a primary resource for
    extracting training data

    View Slide

  36. Snapshots of time can be extracted from the
    language graph

    View Slide

  37. Context can be derived by looking at the relationships
    in the language graph

    View Slide

  38. Elon Musk

    View Slide

  39. Jeff Bezos

    View Slide

  40. Mark Zuckerberg

    View Slide

  41. Context

    View Slide

  42. SiO
    2
    is a living Natural Language Pipeline of networked
    algorithms trained on the corpus of Quartz to understand
    the linguistic patterns of global business news

    View Slide

  43. The Pipeline(s)
    Quartz Corpus -> Training Sentences -> spaCy
    Content -> spaCy -> Language Graph
    Language Graph -> Training Data -> Statistical Models / Classifiers
    Language Graph -> Training Sentences -> spaCy
    Unseen Content -> spaCy -> Pre-Processed Text / Vectors -> Statistical Models / Classifiers

    View Slide

  44. Thank you

    View Slide