Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Designing Practical NLP Solutions

Designing Practical NLP Solutions

Ines Montani

June 18, 2020
Tweet

More Decks by Ines Montani

Other Decks in Technology

Transcript

  1. Designing
    practical
    Ines Montani
    Explosion
    NLP solutions

    View Slide

  2. Early 2015
    spaCy is first released
    • open-source library for industrial-
    strength Natural Language
    Processing
    • focused on production use

    View Slide

  3. Early 2015
    spaCy is first released
    • open-source library for industrial-
    strength Natural Language
    Processing
    • focused on production use
    Current stats
    17m+ total downloads
    16k+ stars on GitHub
    400+ contributors
    80+ extension packages

    View Slide

  4. Late 2016
    Explosion
    • new company for AI developer tools
    • bootstrapped through consulting
    for the first 6 months
    • funded through software sales
    since 2017
    • remote team, centered in Berlin

    View Slide

  5. Late 2016
    Explosion
    • new company for AI developer tools
    • bootstrapped through consulting
    for the first 6 months
    • funded through software sales
    since 2017
    • remote team, centered in Berlin
    Current stats
    8 team members
    100% independent & profitable

    View Slide

  6. Late 2017
    Prodigy
    • first commercial product
    • modern annotation tool
    • fully scriptable in Python

    View Slide

  7. Late 2017
    Prodigy
    • first commercial product
    • modern annotation tool
    • fully scriptable in Python
    Current stats
    4000+ users, including
    500+ companies
    1600+ forum members

    View Slide

  8. Coming soon
    • spaCy v2.3: Models for Chinese,
    Japanese and many more
    • spaCy v3.0: Transformer-based
    pipelines, custom models using
    any library, new training workflow
    • Prodigy v1.10: Dependencies &
    relation annotation, audio & video
    annotation & lots of new features
    • Prodigy Teams: Manage large
    annotation projects in your cloud

    View Slide

  9. NLP project are
    like start-ups:
    they fail a lot

    View Slide

  10. How to maximize your
    project’s risk of failure

    View Slide

  11. How to maximize your
    project’s risk of failure
    Imagineer.
    Forecast.
    Outsource.
    Wire.
    Ship.
    1
    2
    3
    4
    5
    Decide what your
    application ought to do.
    Be ambitious! Nobody
    changed the world saying
    “uh, will that work?”

    View Slide

  12. Imagineer.
    Forecast.
    Outsource.
    Wire.
    Ship.
    How to maximize your
    project’s risk of failure
    1
    2
    3
    4
    5
    Figure out
    what accuracy
    you’ll need. If you’re
    not sure here, just
    say 90%.

    View Slide

  13. How to maximize your
    project’s risk of failure
    1
    2
    3
    4
    5
    Imagineer.
    Forecast.
    Outsource.
    Wire.
    Ship.
    Pay
    someone else to
    gather your data.
    Think carefully about your
    accuracy requirements,
    and then ask for 10k rows.

    View Slide

  14. Imagineer.
    Forecast.
    Outsource.
    Wire.
    Ship.
    How to maximize your
    project’s risk of failure
    1
    2
    3
    4
    5
    Implement your
    network. This is the
    fun part! Tensor all your
    flows, descend every
    gradient!

    View Slide

  15. How to maximize your
    project’s risk of failure
    1
    2
    3
    4
    5
    Imagineer.
    Forecast.
    Outsource.
    Wire.
    Ship.
    Put it all
    together. If it doesn’t
    work, maybe blame
    the intern?

    View Slide

  16. Failure
    sucks

    View Slide

  17. accuracy
    estimate
    training &
    evaluation
    labelled
    data
    annotation
    scheme
    product
    vision
    A difficult
    chicken-and-
    egg problem

    View Slide

  18. You need to
    iterate on your code
    and your data

    View Slide

  19. Requirements
    We’re building a crime database based on news
    reports. We want to label the following:
    victim name
    perpetrator name
    crime location
    offence date
    arrest date
    #1

    View Slide

  20. View Slide

  21. View Slide

  22. View Slide

  23. Requirements
    We’re adding data from financial news about
    company sales to our internal database, so we can
    connect it to our analytics.
    We need to extract:
    buyer (official company name) and stock ticker
    acquired company with stock ticker
    sale price and currency
    #2

    View Slide

  24. pytorch predict company acquisitions with prices and stock tickers
    No results.

    View Slide

  25. “Microsoft acquires software development
    platform GitHub for $7.5 billion”

    View Slide

  26. “Microsoft acquires software development
    platform GitHub for $7.5 billion”

    View Slide

  27. TEXT CLASSIFIER
    “Microsoft acquires software development
    platform GitHub for $7.5 billion”

    View Slide

  28. TEXT CLASSIFIER
    ENTITY RECOGNIZER
    “Microsoft acquires software development
    platform GitHub for $7.5 billion”

    View Slide

  29. TEXT CLASSIFIER
    ENTITY RECOGNIZER
    ENTITY LINKER
    “Microsoft acquires software development
    platform GitHub for $7.5 billion”

    View Slide

  30. TEXT CLASSIFIER
    ENTITY RECOGNIZER
    ENTITY LINKER
    ATTRIBUTE LOOKUP
    “Microsoft acquires software development
    platform GitHub for $7.5 billion”

    View Slide

  31. TEXT CLASSIFIER
    ENTITY RECOGNIZER
    ENTITY LINKER
    ATTRIBUTE LOOKUP
    CURRENCY NORMALIZER
    “Microsoft acquires software development
    platform GitHub for $7.5 billion”

    View Slide

  32. Reality is not
    an end-to-end
    prediction problem

    View Slide

  33. The great thing about practical NLP:
    you can choose to make the problem
    simpler and the solution cheaper.
    #1

    View Slide

  34. The most interesting problems are
    very specific and also need specific
    solutions. That’s what makes them
    valuable.
    #2

    View Slide

  35. Transfer learning means we don’t
    always need “big data” anymore.
    But we need some.
    #3

    View Slide

  36. Thank
    Explosion
    explosion.ai
    Twitter
    @_inesmontani
    you!

    View Slide