Slide 1

Slide 1 text

Designing practical Ines Montani Explosion NLP solutions

Slide 2

Slide 2 text

Early 2015 spaCy is first released • open-source library for industrial- strength Natural Language Processing • focused on production use

Slide 3

Slide 3 text

Early 2015 spaCy is first released • open-source library for industrial- strength Natural Language Processing • focused on production use Current stats 17m+ total downloads 16k+ stars on GitHub 400+ contributors 80+ extension packages

Slide 4

Slide 4 text

Late 2016 Explosion • new company for AI developer tools • bootstrapped through consulting for the first 6 months • funded through software sales since 2017 • remote team, centered in Berlin

Slide 5

Slide 5 text

Late 2016 Explosion • new company for AI developer tools • bootstrapped through consulting for the first 6 months • funded through software sales since 2017 • remote team, centered in Berlin Current stats 8 team members 100% independent & profitable

Slide 6

Slide 6 text

Late 2017 Prodigy • first commercial product • modern annotation tool • fully scriptable in Python

Slide 7

Slide 7 text

Late 2017 Prodigy • first commercial product • modern annotation tool • fully scriptable in Python Current stats 4000+ users, including 500+ companies 1600+ forum members

Slide 8

Slide 8 text

Coming soon • spaCy v2.3: Models for Chinese, Japanese and many more • spaCy v3.0: Transformer-based pipelines, custom models using any library, new training workflow • Prodigy v1.10: Dependencies & relation annotation, audio & video annotation & lots of new features • Prodigy Teams: Manage large annotation projects in your cloud

Slide 9

Slide 9 text

NLP project are like start-ups: they fail a lot

Slide 10

Slide 10 text

How to maximize your project’s risk of failure

Slide 11

Slide 11 text

How to maximize your project’s risk of failure Imagineer. Forecast. Outsource. Wire. Ship. 1 2 3 4 5 Decide what your application ought to do. Be ambitious! Nobody changed the world saying “uh, will that work?”

Slide 12

Slide 12 text

Imagineer. Forecast. Outsource. Wire. Ship. How to maximize your project’s risk of failure 1 2 3 4 5 Figure out what accuracy you’ll need. If you’re not sure here, just say 90%.

Slide 13

Slide 13 text

How to maximize your project’s risk of failure 1 2 3 4 5 Imagineer. Forecast. Outsource. Wire. Ship. Pay someone else to gather your data. Think carefully about your accuracy requirements, and then ask for 10k rows.

Slide 14

Slide 14 text

Imagineer. Forecast. Outsource. Wire. Ship. How to maximize your project’s risk of failure 1 2 3 4 5 Implement your network. This is the fun part! Tensor all your flows, descend every gradient!

Slide 15

Slide 15 text

How to maximize your project’s risk of failure 1 2 3 4 5 Imagineer. Forecast. Outsource. Wire. Ship. Put it all together. If it doesn’t work, maybe blame the intern?

Slide 16

Slide 16 text

Failure sucks

Slide 17

Slide 17 text

accuracy estimate training & evaluation labelled data annotation scheme product vision A difficult chicken-and- egg problem

Slide 18

Slide 18 text

You need to iterate on your code and your data

Slide 19

Slide 19 text

Requirements We’re building a crime database based on news reports. We want to label the following: victim name perpetrator name crime location offence date arrest date #1

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

Requirements We’re adding data from financial news about company sales to our internal database, so we can connect it to our analytics. We need to extract: buyer (official company name) and stock ticker acquired company with stock ticker sale price and currency #2

Slide 24

Slide 24 text

pytorch predict company acquisitions with prices and stock tickers No results.

Slide 25

Slide 25 text

“Microsoft acquires software development platform GitHub for $7.5 billion”

Slide 26

Slide 26 text

“Microsoft acquires software development platform GitHub for $7.5 billion”

Slide 27

Slide 27 text

TEXT CLASSIFIER “Microsoft acquires software development platform GitHub for $7.5 billion”

Slide 28

Slide 28 text

TEXT CLASSIFIER ENTITY RECOGNIZER “Microsoft acquires software development platform GitHub for $7.5 billion”

Slide 29

Slide 29 text

TEXT CLASSIFIER ENTITY RECOGNIZER ENTITY LINKER “Microsoft acquires software development platform GitHub for $7.5 billion”

Slide 30

Slide 30 text

TEXT CLASSIFIER ENTITY RECOGNIZER ENTITY LINKER ATTRIBUTE LOOKUP “Microsoft acquires software development platform GitHub for $7.5 billion”

Slide 31

Slide 31 text

TEXT CLASSIFIER ENTITY RECOGNIZER ENTITY LINKER ATTRIBUTE LOOKUP CURRENCY NORMALIZER “Microsoft acquires software development platform GitHub for $7.5 billion”

Slide 32

Slide 32 text

Reality is not an end-to-end prediction problem

Slide 33

Slide 33 text

The great thing about practical NLP: you can choose to make the problem simpler and the solution cheaper. #1

Slide 34

Slide 34 text

The most interesting problems are very specific and also need specific solutions. That’s what makes them valuable. #2

Slide 35

Slide 35 text

Transfer learning means we don’t always need “big data” anymore. But we need some. #3

Slide 36

Slide 36 text

Thank Explosion explosion.ai Twitter @_inesmontani you!