The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs

Ines Montani Explosion

Open-source library for industrial-strength natural language processing spacy.io SPACY 200m+
downloads

Open-source library for industrial-strength natural language processing spacy.io SPACY ChatGPT
can write spaCy code! 200m+ downloads

Modern scriptable annotation tool for machine learning developers prodigy.ai PRODIGY
9k+ 800+ users companies

Modern scriptable annotation tool for machine learning developers prodigy.ai PRODIGY
9k+ 800+ users companies Alex Smith Developer Kim Miller Analyst

Collaborative data development platform prodigy.ai/teams PRODIGY TEAMS BETA

Collaborative data development platform Alex Smith Developer Kim Miller Analyst
GPT-4 API prodigy.ai/teams PRODIGY TEAMS BETA

WHY OPEN SOURCE?

WHY OPEN SOURCE? transparent

WHY OPEN SOURCE? transparent no lock-in

WHY OPEN SOURCE? transparent no lock-in extensible

WHY OPEN SOURCE? transparent no lock-in extensible runs in-house

WHY OPEN SOURCE? transparent no lock-in extensible runs in-house easy
to get started

WHY OPEN SOURCE? transparent no lock-in extensible community-vetted runs in-house
easy to get started

WHY OPEN SOURCE? transparent no lock-in programmable extensible community-vetted runs
in-house easy to get started

WHY OPEN SOURCE? transparent no lock-in up to date programmable
extensible community-vetted runs in-house easy to get started

WHY OPEN SOURCE? transparent no lock-in up to date programmable
extensible community-vetted runs in-house easy to get started also free!

OPEN-SOURCE MODELS

task-specific models OPEN-SOURCE MODELS

task-specific models small, often fast, cheap to run, don’t always
generalize well, need data to fine-tune OPEN-SOURCE MODELS

encoder models ELECTRA T5 task-specific models small, often fast, cheap
to run, don’t always generalize well, need data to fine-tune OPEN-SOURCE MODELS

to run, don’t always generalize well, need data to fine-tune relatively small and fast, a ordable to run, generalize & adapt well, need data to fine-tune OPEN-SOURCE MODELS

to run, don’t always generalize well, need data to fine-tune relatively small and fast, a ordable to run, generalize & adapt well, need data to fine-tune OPEN-SOURCE MODELS large generative models Falcon MIXTRAL

to run, don’t always generalize well, need data to fine-tune relatively small and fast, a ordable to run, generalize & adapt well, need data to fine-tune OPEN-SOURCE MODELS large generative models Falcon MIXTRAL very large, often slower, expensive to run, generalize & adapt well, need little to no data

encoder models large generative models ENCODING & DECODING TASKS

encoder models large generative models ENCODING & DECODING TASKS network
trained for specific tasks using model to encode input 🔮 model 📖 text vectors 🔮 task model task output 🧬 task network labels

encoder models large generative models ENCODING & DECODING TASKS model
generates text that can be parsed into task-specific output 📖 text 🔮 model raw output ⚙ parser task output 💬 template prompt network trained for specific tasks using model to encode input 🔮 model 📖 text vectors 🔮 task model task output 🧬 task network labels

to run, don’t always generalize well, need data to fine-tune relatively small and fast, a ordable to run, generalize & adapt well, need data to fine-tune OPEN-SOURCE MODELS large generative models Falcon MIXTRAL very large, often slower, expensive to run, generalize & adapt well, need little to no data

output costs ECONOMIES OF SCALE

output costs OpenAI Google ECONOMIES OF SCALE

output costs OpenAI Google ECONOMIES OF SCALE access to talent,
compute etc.

compute etc. API request batching

compute etc. API request batching high tra ff ic 💧 💧 💧 💧 💧 💧 💧 💧 low tra ff ic batch 💧 💧 💧 💧 💧 💧 💧 💧 …

output costs OpenAI Google you 🤠 ECONOMIES OF SCALE access
to talent, compute etc. API request batching high tra ff ic 💧 💧 💧 💧 💧 💧 💧 💧 low tra ff ic batch 💧 💧 💧 💧 💧 💧 💧 💧 …

human-facing systems machine-facing models ChatGPT GPT-4 AI PRODUCTS ARE MORE
THAN JUST A MODEL

human-facing systems machine-facing models ChatGPT GPT-4 most important di erentiation
is product, not just technology AI PRODUCTS ARE MORE THAN JUST A MODEL

human-facing systems machine-facing models ChatGPT GPT-4 most important di erentiation
is product, not just technology UI / UX marketing customization AI PRODUCTS ARE MORE THAN JUST A MODEL

human-facing systems machine-facing models ChatGPT GPT-4 swappable components based on
research, impacts are quantifiable most important di erentiation is product, not just technology UI / UX marketing customization AI PRODUCTS ARE MORE THAN JUST A MODEL

research, impacts are quantifiable most important di erentiation is product, not just technology cost speed accuracy latency UI / UX marketing customization AI PRODUCTS ARE MORE THAN JUST A MODEL

research, impacts are quantifiable most important di erentiation is product, not just technology cost speed accuracy latency UI / UX marketing customization AI PRODUCTS ARE MORE THAN JUST A MODEL But what about the data?

research, impacts are quantifiable most important di erentiation is product, not just technology cost speed accuracy latency UI / UX marketing customization AI PRODUCTS ARE MORE THAN JUST A MODEL But what about the data? User data is an advantage for product, not the foundation for machine-facing tasks.

research, impacts are quantifiable most important di erentiation is product, not just technology cost speed accuracy latency UI / UX marketing customization AI PRODUCTS ARE MORE THAN JUST A MODEL But what about the data? User data is an advantage for product, not the foundation for machine-facing tasks. You don’t need specific data to gain general knowledge.

USE CASES IN INDUSTRY predictive tasks 🔖 entity recognition 🔗
relation extraction 👫 coreference resolution 🧬 grammar & morphology 🎯 semantic parsing 💬 discourse structure 📚 text classification generative tasks 📖 single/multi-doc summarization 🧮 reasoning ✅ problem solving ✍ paraphrasing 🖼 style transfer ⁉ question answering

USE CASES IN INDUSTRY predictive tasks 🔖 entity recognition 🔗
relation extraction 👫 coreference resolution 🧬 grammar & morphology 🎯 semantic parsing 💬 discourse structure 📚 text classification generative tasks 📖 single/multi-doc summarization 🧮 reasoning ✅ problem solving ✍ paraphrasing 🖼 style transfer ⁉ question answering many industry problems have remained the same, they just changed in scale structured data

EVOLUTION OF PROBLEM DEFINITIONS

rules or instructions ✍ EVOLUTION OF PROBLEM DEFINITIONS

programming & rules rules or instructions ✍ EVOLUTION OF PROBLEM
DEFINITIONS

programming & rules rules or instructions ✍ machine learning examples
📝 EVOLUTION OF PROBLEM DEFINITIONS

supervised learning programming & rules rules or instructions ✍ machine
learning examples 📝 EVOLUTION OF PROBLEM DEFINITIONS

supervised learning programming & rules rules or instructions ✍ in-context
learning rules or instructions ✍ machine learning examples 📝 EVOLUTION OF PROBLEM DEFINITIONS

supervised learning prompt engineering programming & rules rules or instructions
✍ in-context learning rules or instructions ✍ machine learning examples 📝 EVOLUTION OF PROBLEM DEFINITIONS

✍ in-context learning rules or instructions ✍ machine learning examples 📝 EVOLUTION OF PROBLEM DEFINITIONS instructions: human-shaped, easy for non-experts, risk of data drift ✍

✍ in-context learning rules or instructions ✍ machine learning examples 📝 EVOLUTION OF PROBLEM DEFINITIONS instructions: human-shaped, easy for non-experts, risk of data drift ✍ 📝 examples: nuanced and intuitive behaviors, specific to use case, labor-intensive

large general- purpose model domain- specific data WORKFLOW EXAMPLE

prompting large general- purpose model domain- specific data WORKFLOW EXAMPLE

prompting large general- purpose model continuous evaluation baseline domain- specific
data WORKFLOW EXAMPLE

prompting large general- purpose model continuous evaluation baseline domain- specific
data WORKFLOW EXAMPLE iterative model-assisted data annotation prodigy.ai

prompting large general- purpose model distilled task- specific model transfer
learning continuous evaluation baseline domain- specific data WORKFLOW EXAMPLE iterative model-assisted data annotation prodigy.ai

prompting large general- purpose model distilled task- specific model transfer
learning continuous evaluation baseline distilled model domain- specific data WORKFLOW EXAMPLE iterative model-assisted data annotation prodigy.ai

processing pipeline prototype PROTOTYPE TO PRODUCTION

github.com/explosion/spacy-llm prompt model & transform output to structured data processing
pipeline prototype PROTOTYPE TO PRODUCTION

processing pipeline in production swap, replace and mix components github.com/explosion/spacy-llm
prompt model & transform output to structured data processing pipeline prototype PROTOTYPE TO PRODUCTION

processing pipeline in production swap, replace and mix components github.com/explosion/spacy-llm
prompt model & transform output to structured data structured machine-facing Doc object processing pipeline prototype PROTOTYPE TO PRODUCTION

DISTILLED TASK-SPECIFIC COMPONENTS

modular DISTILLED TASK-SPECIFIC COMPONENTS

modular no lock-in DISTILLED TASK-SPECIFIC COMPONENTS

modular testable no lock-in DISTILLED TASK-SPECIFIC COMPONENTS

modular testable no lock-in extensible DISTILLED TASK-SPECIFIC COMPONENTS

modular testable flexible no lock-in extensible DISTILLED TASK-SPECIFIC COMPONENTS

modular testable flexible no lock-in extensible cheap to run DISTILLED
TASK-SPECIFIC COMPONENTS

modular testable flexible no lock-in extensible run in-house cheap to
run DISTILLED TASK-SPECIFIC COMPONENTS

modular testable flexible no lock-in programmable extensible run in-house cheap
to run DISTILLED TASK-SPECIFIC COMPONENTS

modular testable flexible predictable no lock-in programmable extensible run in-house
cheap to run DISTILLED TASK-SPECIFIC COMPONENTS

modular testable flexible predictable transparent no lock-in programmable extensible run
in-house cheap to run DISTILLED TASK-SPECIFIC COMPONENTS

The Zen of Python >>> import this Beautiful is better
than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess.

The Zen of Python >>> import this don’t abandon what’s
made software successful Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess.

control resource regulation compounding economies of scale network e ects
MONOPOLY STRATEGIES

control resource regulation compounding economies of scale network e ects
MONOPOLY STRATEGIES human-facing products vs. machine-facing models

THE AI REVOLUTION WON’T BE MONOPOLIZED

THE AI REVOLUTION WON’T BE MONOPOLIZED The software industry does
not run on secret sauce. Knowledge gets shared and published. Secrets won’t give anyone a monopoly.

not run on secret sauce. Knowledge gets shared and published. Secrets won’t give anyone a monopoly. Usage data is great for improving a product, but it doesn’t generalize. Data won’t give anyone a monopoly.

not run on secret sauce. Knowledge gets shared and published. Secrets won’t give anyone a monopoly. LLMs can be one part of a product or process, and swapped for di erent approaches. Interoperability is the opposite of monopoly. Usage data is great for improving a product, but it doesn’t generalize. Data won’t give anyone a monopoly.

not run on secret sauce. Knowledge gets shared and published. Secrets won’t give anyone a monopoly. LLMs can be one part of a product or process, and swapped for di erent approaches. Interoperability is the opposite of monopoly. Usage data is great for improving a product, but it doesn’t generalize. Data won’t give anyone a monopoly. Regulation could give someone a monopoly, if we let it. It should focus on products and actions, not components.

Explosion spaCy Prodigy Twitter Mastodon Bluesky explosion.ai spacy.io prodigy.ai @_inesmontani
@[email protected] @inesmontani.bsky.social LinkedIn

The AI Revolution Will Not Be Monopolized: How ...

The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs

Video

Resources

The AI Revolution Will Not Be Monopolized

Behind the scenes

A practical guide to human-in-the-loop distillation

Taking LLMs out of the black box: A practical guide to human-in-the-loop distillation

More Decks by Ines Montani

Other Decks in Technology

Featured

Transcript