Designing Practical NLP Solutions

Designing Practical NLP Solutions

C005d9d90f1b1b1c2a0a478d67f1fee9?s=128

Ines Montani

June 18, 2020
Tweet

Transcript

  1. Designing practical Ines Montani Explosion NLP solutions

  2. Early 2015 spaCy is first released • open-source library for

    industrial- strength Natural Language Processing • focused on production use
  3. Early 2015 spaCy is first released • open-source library for

    industrial- strength Natural Language Processing • focused on production use Current stats 17m+ total downloads 16k+ stars on GitHub 400+ contributors 80+ extension packages
  4. Late 2016 Explosion • new company for AI developer tools

    • bootstrapped through consulting for the first 6 months • funded through software sales since 2017 • remote team, centered in Berlin
  5. Late 2016 Explosion • new company for AI developer tools

    • bootstrapped through consulting for the first 6 months • funded through software sales since 2017 • remote team, centered in Berlin Current stats 8 team members 100% independent & profitable
  6. Late 2017 Prodigy • first commercial product • modern annotation

    tool • fully scriptable in Python
  7. Late 2017 Prodigy • first commercial product • modern annotation

    tool • fully scriptable in Python Current stats 4000+ users, including 500+ companies 1600+ forum members
  8. Coming soon • spaCy v2.3: Models for Chinese, Japanese and

    many more • spaCy v3.0: Transformer-based pipelines, custom models using any library, new training workflow • Prodigy v1.10: Dependencies & relation annotation, audio & video annotation & lots of new features • Prodigy Teams: Manage large annotation projects in your cloud
  9. NLP project are like start-ups: they fail a lot

  10. How to maximize your project’s risk of failure

  11. How to maximize your project’s risk of failure Imagineer. Forecast.

    Outsource. Wire. Ship. 1 2 3 4 5 Decide what your application ought to do. Be ambitious! Nobody changed the world saying “uh, will that work?”
  12. Imagineer. Forecast. Outsource. Wire. Ship. How to maximize your project’s

    risk of failure 1 2 3 4 5 Figure out what accuracy you’ll need. If you’re not sure here, just say 90%.
  13. How to maximize your project’s risk of failure 1 2

    3 4 5 Imagineer. Forecast. Outsource. Wire. Ship. Pay someone else to gather your data. Think carefully about your accuracy requirements, and then ask for 10k rows.
  14. Imagineer. Forecast. Outsource. Wire. Ship. How to maximize your project’s

    risk of failure 1 2 3 4 5 Implement your network. This is the fun part! Tensor all your flows, descend every gradient!
  15. How to maximize your project’s risk of failure 1 2

    3 4 5 Imagineer. Forecast. Outsource. Wire. Ship. Put it all together. If it doesn’t work, maybe blame the intern?
  16. Failure sucks

  17. accuracy estimate training & evaluation labelled data annotation scheme product

    vision A difficult chicken-and- egg problem
  18. You need to iterate on your code and your data

  19. Requirements We’re building a crime database based on news reports.

    We want to label the following: victim name perpetrator name crime location offence date arrest date #1
  20. None
  21. None
  22. None
  23. Requirements We’re adding data from financial news about company sales

    to our internal database, so we can connect it to our analytics. We need to extract: buyer (official company name) and stock ticker acquired company with stock ticker sale price and currency #2
  24. pytorch predict company acquisitions with prices and stock tickers No

    results.
  25. “Microsoft acquires software development platform GitHub for $7.5 billion”

  26. “Microsoft acquires software development platform GitHub for $7.5 billion”

  27. TEXT CLASSIFIER “Microsoft acquires software development platform GitHub for $7.5

    billion”
  28. TEXT CLASSIFIER ENTITY RECOGNIZER “Microsoft acquires software development platform GitHub

    for $7.5 billion”
  29. TEXT CLASSIFIER ENTITY RECOGNIZER ENTITY LINKER “Microsoft acquires software development

    platform GitHub for $7.5 billion”
  30. TEXT CLASSIFIER ENTITY RECOGNIZER ENTITY LINKER ATTRIBUTE LOOKUP “Microsoft acquires

    software development platform GitHub for $7.5 billion”
  31. TEXT CLASSIFIER ENTITY RECOGNIZER ENTITY LINKER ATTRIBUTE LOOKUP CURRENCY NORMALIZER

    “Microsoft acquires software development platform GitHub for $7.5 billion”
  32. Reality is not an end-to-end prediction problem

  33. The great thing about practical NLP: you can choose to

    make the problem simpler and the solution cheaper. #1
  34. The most interesting problems are very specific and also need

    specific solutions. That’s what makes them valuable. #2
  35. Transfer learning means we don’t always need “big data” anymore.

    But we need some. #3
  36. Thank Explosion explosion.ai Twitter @_inesmontani you!