Upgrade to Pro — share decks privately, control downloads, hide ads and more …

spaCy and Explosion: past, present & future

spaCy and Explosion: past, present & future

Ines Montani

July 06, 2019
Tweet

Video

More Decks by Ines Montani

Other Decks in Programming

Transcript

  1. 2015 First collaborations • demos & visualizers like displaCy •

    first concepts of a modern approach to NLP annotation tools
  2. 2015 First collaborations “Baskets” concept • demos & visualizers like

    displaCy • first concepts of a modern approach to NLP annotation tools
  3. 2015 First collaborations “Baskets” concept Binary annotation tool concept •

    demos & visualizers like displaCy • first concepts of a modern approach to NLP annotation tools
  4. Early 2016 First highlights: German model • first non-English model

    • non-projective dependencies • developed by Wolfgang Seeker
  5. Late 2016 Explosion • new company for AI developer tools

    • bootstrapped through consulting 
 for the first 6 months • funded through software sales 
 since 2017 • 100% independent and profitable
  6. Late 2016 Explosion • new company for AI developer tools

    • bootstrapped through consulting 
 for the first 6 months • funded through software sales 
 since 2017 • 100% independent and profitable Our bets about NLP • NLP won’t just be a cloud API • number of developers will increase • annotation is better in-house
  7. 2017 spaCy v2.0 • shift to deep learning • smaller

    and updatable models • custom pipeline components • custom extension attributes • built-in text classification • built-in displaCy visualizers • many other improvements Thinc, spaCy’s machine learning library
  8. Late 2017 Prodigy • first commercial product • modern annotation

    tool • fully scriptable in Python users 2,000+ 250+ companies incl.
  9. , Early 2019 spaCy v2.1 • transfer learning and pretraining

    • 2-3 times faster tokenization • enhanced match pattern API • built-in rule-based NER • many other improvements
  10. , Early 2019 spaCy v2.1 • transfer learning and pretraining

    • 2-3 times faster tokenization • enhanced match pattern API • built-in rule-based NER • many other improvements Transfer learning • better models with less data – huge win! • how to adapt for spaCy without bigger (and slower) models? • spacy pretrain is a pretty cool compromise
  11. July 2019 Explosion team Matthew Ines Montani Sebastián Ramírez Guadalupe

    Romero Giannis Daras Justin DuJardin Van Landeghem Sofie Honnibal
  12. What’s next? spaCy v3.0 • morphological features • entity linking

    • non-entity span tagging • static analysis of processing pipeline and its components
  13. What’s next? spaCy v3.0 Vision for spaCy • focus on

    data structures and pipeline • build support for new tasks even if we don’t have a model • make sure it’s easy to BYO model • keep shipping good defaults • morphological features • entity linking • non-entity span tagging • static analysis of processing pipeline and its components
  14. What’s next? spaCy v3.0 Vision for spaCy • focus on

    data structures and pipeline • build support for new tasks even if we don’t have a model • make sure it's easy to BYO model • keep shipping good defaults What’s out-of-scope? • anything generative: summarization, machine translation, etc. • multi-modal: audio, video, etc. • research assistance: plenty of good frameworks for developing novel techniques • morphological features • entity linking • non-entity span tagging • static analysis of processing pipeline and its components
  15. , What’s next? spaCy ecosystem in your cloud • whole

    systems, not just libraries • programmable, extensible cluster • running under your control • automated setup, good defaults • full data privacy – we don’t want your data!
  16. , What’s next? spaCy ecosystem in your cloud • whole

    systems, not just libraries • programmable, extensible cluster • running under your control • automated setup, good defaults • full data privacy – we don’t want your data! Processing with Dask
  17. , What’s next? spaCy ecosystem in your cloud • whole

    systems, not just libraries • programmable, extensible cluster • running under your control • automated setup, good defaults • full data privacy – we don’t want your data! Prodigy Scale Processing with Dask