Smart shortcuts for bootstrapping a modern NLP project

Ray Europe Meetup David Berenstein Developer Advocate Engineer Shortcuts for
bootstrapping NLP pipelines 22 feb 2023

Agenda 1. Very brief intro to NLP 2. Argilla 3.
Cool Shortcuts a. Weak supervision b. Few-shot learning c. Active learning 4. Wrap-up

A brief intro to NLP 1. ChatGPT 2. Text Classification
3. Lexical vs Semantic

Argilla

What is Argilla? Open-source labelling platform for data-centric NLP: -
Quickly build high quality training data - Evaluate and improve models over time - Enable collaboration between data teams and domain experts

Argilla components Argilla Server Elasticsearch Argilla Python Client Argilla UI
Kibana Create and update datasets with annotations and predictions Load datasets for model training and evaluation Listen to changes in datasets for active learning, evaluation, and training. Programmatic labelling of datasets Label data manually and with with rules based on search queries Analyse and review model predictions and training data Models Data sources Store and manage datasets Compute dataset and record-level metrics Dashboards and alerts for model monitoring Dashboards and alerts for data annotation management

Argilla Data Model A Dataset is a collection of Records.
Records are defined by their NLP task (e.g., Text classification). Records can contain Metadata, text Inputs, model Predictions, and “human” Annotations https://docs.argilla.io/en/latest/reference/datamodel.html

Argilla Users and Workspaces User User workspace Team workspace Datasets
Datasets

Argilla Users and Workspaces

Argilla: Text Classification

Argilla: Token Classification (NER)

Argilla: Text Generation

Argilla Workflow: Training Read dataset Data source Create records Model
Log records Explore Label Load records Prepare for training Train Model

Argilla Workflow: Evaluation Metrics Explore Label Read dataset Data source
Create records Model Log records

Argilla Workflow: Monitoring Metrics Explore Label Create records Log records
(async) Model Model requests monitor ASGI middleware Load records prepare for training

Argilla Python: Create records https://docs.argilla.io/en/latest/reference/python/python_client.html#argilla.client.models.TextClassifi cationRecord

Argilla Python: Create records (from pandas/datasets) https://docs.argilla.io/en/latest/reference/python/python_client.html#module-argilla.client.datasets

Argilla Python: Log records (write data) https://docs.argilla.io/en/latest/reference/python/python_client.html#argilla.log

Argilla Python: Load records (read data) https://docs.argilla.io/en/latest/reference/python/python_client.html#argilla.load

Argilla Python: Prepare dataset for training https://docs.argilla.io/en/latest/reference/python/python_client.html#argilla.client.datasets. DatasetForTextClassification.prepare_for_training

Argilla Python: Export records (write dataset to disk) https://docs.argilla.io/en/latest/reference/python/python_client.html#argilla.load

Argilla Python: Metrics https://docs.argilla.io/en/latest/reference/python/python_metrics.html#python-metrics

News datasets • https://huggingface.co/datasets/argilla/news • Basic Text Classification ◦ Sci/Tech
◦ Business ◦ Sports ◦ World

Data Exploration - basic UI • Keywords • Text Queries
(Lucene QL) ◦ politics AND president ◦ sports OR ball ◦ invest* • Filters

Token attributions - highlight nouns

Similarity search - embeddings

Brief Demo

Data Exploration - UMAP • Fast Sentence Transformers • UMAP
• Plotly Express + Chart Studio

Ray[data] - distributed NLP pre-processing • connect to (local) ray
cluster • distributed data and models ◦ clean text ◦ add POS ◦ add embeddings

Weak Supervision

Weak supervision: an overview

Weak supervision: lexical rules • Query Argilla • Choose Relevant
Keywords • Define Rules ◦ human readable ◦ quick and easy ◦ no programming • Apply rules • Prepare for training rules within UI

Weak supervision: semantic exemplars • load annotated data • average
embeddings of annotations • search most similar • assign annotations as predictions

Few-shot learning

Few shot learning • 8 annotations per class • classy-classification
◦ with spaCy or sentence-transformers ◦ uses ONNX • “training” and inference runs in 1 minute (n=500) BONUS • multi-lingual model => multi-lingual predictions • solves low resource language problems ◦ crosslingual-coreference

Active learning

Active learning: an overview

Active learning • work smarter • listen to newly annotated
samples • update your model • make new predictions demo

Training a formal modal

ray[train, tune] • weights and biases

Wrap up

• what resources DO I have • be creative •
iteration is key

Crosslingual Coreference

Concise Concepts

Sources • Argilla ◦ Repo https://github.com/argilla-io/argilla ◦ Cool Datasets on
the hub https://huggingface.co/argilla ◦ Deploy with the click of a button ◦ LinkedIn https://www.linkedin.com/company/argilla-io/ • Me ◦ My packages https://github.com/Pandora-Intelligence ◦ Volunteering https://bonfari.nl/ • Other Packages ◦ Few shot learning ▪ SetFit ◦ Weak supervision ▪ Skweak ◦ Active learning ▪ small-text

Feedback and questions

Smart shortcuts for bootstrapping a modern NLP ...

Smart shortcuts for bootstrapping a modern NLP project

More Decks by Anyscale

Other Decks in Programming

Featured

Transcript