Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

The AI Revolution Will Not Be Monopolized: Behi...

The AI Revolution Will Not Be Monopolized: Behind the scenes

A more in-depth look at the concepts and ideas behind my talk "The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs", including academic literature, related experiments and preliminary results for distilled task-specific models.

Ines Montani

April 21, 2024
Tweet

Resources

The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs

https://speakerdeck.com/inesmontani/the-ai-revolution-will-not-be-monopolized-how-open-source-beats-economies-of-scale-even-for-llms

Slides for PyCon Lithuania keynote

More Decks by Ines Montani

Other Decks in Technology

Transcript

  1. THE SCENES BEHIND THE SCENES * * * BEHIND THE

    SCENES * * * BEHIN Ines Montani Explosion
  2. Modern scriptable annotation tool for machine learning developers prodigy.ai PRODIGY

    9k+ 800+ users companies Alex Smith Developer Kim Miller Analyst
  3. UNDERSTANDING NLP TASKS generative tasks 📖 single/multi-doc summarization 🧮 reasoning

    ✅ problem solving ✍ paraphrasing 🖼 style transfer ⁉ question answering
  4. UNDERSTANDING NLP TASKS generative tasks 📖 single/multi-doc summarization 🧮 reasoning

    ✅ problem solving ✍ paraphrasing 🖼 style transfer ⁉ question answering human- readable
  5. UNDERSTANDING NLP TASKS predictive tasks 🔖 entity recognition 🔗 relation

    extraction 👫 coreference resolution 🧬 grammar & morphology 🎯 semantic parsing 💬 discourse structure 📚 text classification generative tasks 📖 single/multi-doc summarization 🧮 reasoning ✅ problem solving ✍ paraphrasing 🖼 style transfer ⁉ question answering human- readable
  6. UNDERSTANDING NLP TASKS predictive tasks 🔖 entity recognition 🔗 relation

    extraction 👫 coreference resolution 🧬 grammar & morphology 🎯 semantic parsing 💬 discourse structure 📚 text classification generative tasks 📖 single/multi-doc summarization 🧮 reasoning ✅ problem solving ✍ paraphrasing 🖼 style transfer ⁉ question answering human- readable machine- readable
  7. PREDICTIVE QUADRANT generic objective, negligible task data zero/few-shot in-context learning

    generic objective, task data fine-tuned in-context learning
  8. PREDICTIVE QUADRANT generic objective, negligible task data zero/few-shot in-context learning

    generic objective, task data fine-tuned in-context learning task objective, no task-specific labels nothing
  9. PREDICTIVE QUADRANT generic objective, negligible task data zero/few-shot in-context learning

    generic objective, task data fine-tuned in-context learning task objective, no task-specific labels nothing task objective, task data fine-tuned transfer learning, BERT etc.
  10. PREDICTIVE QUADRANT generic objective, negligible task data zero/few-shot in-context learning

    generic objective, task data fine-tuned in-context learning task objective, no task-specific labels nothing task objective, task data fine-tuned transfer learning, BERT etc.
  11. massive number of experiments: many tasks, lots of models TURE

    * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L
  12. massive number of experiments: many tasks, lots of models results

    way below task-specific models across the board TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L
  13. fine-tuning an LLM for few-shot NER works TURE * *

    * LITERATURE * * * LITERATURE * * * LITERATURE * * * L
  14. fine-tuning an LLM for few-shot NER works BERT-base still competitive

    overall TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L
  15. fine-tuning an LLM for few-shot NER works ChatGPT scores poorly

    BERT-base still competitive overall TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L
  16. found ChatGPT did better than crowd workers on several text

    classification tasks TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L
  17. found ChatGPT did better than crowd workers on several text

    classification tasks accuracy still low against trained annotators TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L
  18. found ChatGPT did better than crowd workers on several text

    classification tasks accuracy still low against trained annotators says more about crowd-worker methodology than LLMs TURE * * * LITERATURE * * * LITERATURE * * * LITERATURE * * * L
  19. F-Score Speed (words/s) GPT-3.5 1 78.6 < 100 GPT-4 1

    83.5 < 100 spaCy 91.6 4,000 Flair 93.1 1,000 SOTA 2023 2 94.6 1,000 SOTA 2003 3 88.8 > 20,000 1. Ashok and Lipton (2023), 2. Wang et al. (2021), 3. Florian et al. (2003) * * EXPERIMENTS * * * EXPERIMENTS * * * EXPERIMENTS * * * CoNLL 2003: Named Entity Recognition
  20. F-Score Speed (words/s) GPT-3.5 1 78.6 < 100 GPT-4 1

    83.5 < 100 spaCy 91.6 4,000 Flair 93.1 1,000 SOTA 2023 2 94.6 1,000 SOTA 2003 3 88.8 > 20,000 1. Ashok and Lipton (2023), 2. Wang et al. (2021), 3. Florian et al. (2003) SOTA on few- shot prompting RoBERTa-base * * EXPERIMENTS * * * EXPERIMENTS * * * EXPERIMENTS * * * CoNLL 2003: Named Entity Recognition
  21. FabNER Claude 2 accuracy on # of examples 10 20

    30 40 50 60 70 80 90 100 0 100 200 300 400 500 20 examples F-Score Speed (words/s) GPT-3.5 1 78.6 < 100 GPT-4 1 83.5 < 100 spaCy 91.6 4,000 Flair 93.1 1,000 SOTA 2023 2 94.6 1,000 SOTA 2003 3 88.8 > 20,000 1. Ashok and Lipton (2023), 2. Wang et al. (2021), 3. Florian et al. (2003) SOTA on few- shot prompting RoBERTa-base * * EXPERIMENTS * * * EXPERIMENTS * * * EXPERIMENTS * * * CoNLL 2003: Named Entity Recognition
  22. processing pipeline in production swap, replace and mix components github.com/explosion/spacy-llm

    prompt model & transform output to structured data processing pipeline prototype PROTOTYPE TO PRODUCTION
  23. processing pipeline in production swap, replace and mix components github.com/explosion/spacy-llm

    prompt model & transform output to structured data structured machine-facing Doc object processing pipeline prototype PROTOTYPE TO PRODUCTION
  24. Generative LLM Distilled Component Accuracy (F-score) 0.74 0.74 Speed (words/second)

    < 100 ~ 2,000 Model Size ~ 5 TB 400 MB Parameters 1.8t 130m Training Examples 0 800 Evaluation Examples 200 200 Data Development Time (hours) ~ 2 ~ 8 RESULTS * * * PRELIMINARY RESULTS * * * PRELIMINARY RESULTS * ** LLM-assisted annotation
  25. Generative LLM Distilled Component Accuracy (F-score) 0.74 0.74 Speed (words/second)

    < 100 ~ 2,000 Model Size ~ 5 TB 400 MB Parameters 1.8t 130m Training Examples 0 800 Evaluation Examples 200 200 Data Development Time (hours) ~ 2 ~ 8 ~ 2,000 400 MB RESULTS * * * PRELIMINARY RESULTS * * * PRELIMINARY RESULTS * ** LLM-assisted annotation
  26. Predictive tasks still matter. Generative complements predictive. It doesn’t replace

    it. In-context learning with prompts alone is not optimal for predictive tasks.
  27. Predictive tasks still matter. Generative complements predictive. It doesn’t replace

    it. Analysis and evaluation take time. You can’t get a new system in minutes, no matter which approach you take. In-context learning with prompts alone is not optimal for predictive tasks.
  28. Predictive tasks still matter. Generative complements predictive. It doesn’t replace

    it. Analysis and evaluation take time. You can’t get a new system in minutes, no matter which approach you take. In-context learning with prompts alone is not optimal for predictive tasks. Don’t abandon development principles that made software successful: modularity, testability and flexibility.
  29. * APRIL 24, 14:45 * * * PYCON DE &

    PYDATA BERLIN * * * See you at the conference 👋 Explosion spaCy Prodigy Twitter Mastodon Bluesky explosion.ai spacy.io prodigy.ai @_inesmontani @[email protected] @inesmontani.bsky.social LinkedIn