Smart shortcuts for bootstrapping a modern NLP project

Slide 1

Slide 1 text

Ray Europe Meetup David Berenstein Developer Advocate Engineer Shortcuts for bootstrapping NLP pipelines 22 feb 2023

Slide 2

Slide 2 text

Agenda 1. Very brief intro to NLP 2. Argilla 3. Cool Shortcuts a. Weak supervision b. Few-shot learning c. Active learning 4. Wrap-up

Slide 3

Slide 3 text

A brief intro to NLP 1. ChatGPT 2. Text Classification 3. Lexical vs Semantic

Slide 4

Slide 4 text

Argilla

Slide 5

Slide 5 text

What is Argilla? Open-source labelling platform for data-centric NLP: - Quickly build high quality training data - Evaluate and improve models over time - Enable collaboration between data teams and domain experts

Slide 6

Slide 6 text

Argilla components Argilla Server Elasticsearch Argilla Python Client Argilla UI Kibana Create and update datasets with annotations and predictions Load datasets for model training and evaluation Listen to changes in datasets for active learning, evaluation, and training. Programmatic labelling of datasets Label data manually and with with rules based on search queries Analyse and review model predictions and training data Models Data sources Store and manage datasets Compute dataset and record-level metrics Dashboards and alerts for model monitoring Dashboards and alerts for data annotation management

Slide 7

Slide 7 text

Argilla Data Model A Dataset is a collection of Records. Records are defined by their NLP task (e.g., Text classification). Records can contain Metadata, text Inputs, model Predictions, and “human” Annotations https://docs.argilla.io/en/latest/reference/datamodel.html

Slide 8

Slide 8 text

Argilla Users and Workspaces User User workspace Team workspace Datasets Datasets

Slide 9

Slide 9 text

Argilla Users and Workspaces

Slide 10

Slide 10 text

Argilla: Text Classification

Slide 11

Slide 11 text

Argilla: Token Classification (NER)

Slide 12

Slide 12 text

Argilla: Text Generation

Slide 13

Slide 13 text

Argilla Workflow: Training Read dataset Data source Create records Model Log records Explore Label Load records Prepare for training Train Model

Slide 14

Slide 14 text

Argilla Workflow: Evaluation Metrics Explore Label Read dataset Data source Create records Model Log records

Slide 15

Slide 15 text

Argilla Workflow: Monitoring Metrics Explore Label Create records Log records (async) Model Model requests monitor ASGI middleware Load records prepare for training

Slide 16

Slide 16 text

Argilla Python: Create records https://docs.argilla.io/en/latest/reference/python/python_client.html#argilla.client.models.TextClassifi cationRecord

Slide 17

Slide 17 text

Argilla Python: Create records (from pandas/datasets) https://docs.argilla.io/en/latest/reference/python/python_client.html#module-argilla.client.datasets

Slide 18

Slide 18 text

Argilla Python: Log records (write data) https://docs.argilla.io/en/latest/reference/python/python_client.html#argilla.log

Slide 19

Slide 19 text

Argilla Python: Load records (read data) https://docs.argilla.io/en/latest/reference/python/python_client.html#argilla.load

Slide 20

Slide 20 text

Argilla Python: Prepare dataset for training https://docs.argilla.io/en/latest/reference/python/python_client.html#argilla.client.datasets. DatasetForTextClassification.prepare_for_training

Slide 21

Slide 21 text

Argilla Python: Export records (write dataset to disk) https://docs.argilla.io/en/latest/reference/python/python_client.html#argilla.load

Slide 22

Slide 22 text

Argilla Python: Metrics https://docs.argilla.io/en/latest/reference/python/python_metrics.html#python-metrics

Slide 23

Slide 23 text

News datasets ● https://huggingface.co/datasets/argilla/news ● Basic Text Classification ○ Sci/Tech ○ Business ○ Sports ○ World

Slide 24

Slide 24 text

Data Exploration - basic UI ● Keywords ● Text Queries (Lucene QL) ○ politics AND president ○ sports OR ball ○ invest* ● Filters

Slide 25

Slide 25 text

Token attributions - highlight nouns

Slide 26

Slide 26 text

Similarity search - embeddings

Slide 27

Slide 27 text

Brief Demo

Slide 28

Slide 28 text

Data Exploration - UMAP ● Fast Sentence Transformers ● UMAP ● Plotly Express + Chart Studio

Slide 29

Slide 29 text

Ray[data] - distributed NLP pre-processing ● connect to (local) ray cluster ● distributed data and models ○ clean text ○ add POS ○ add embeddings

Slide 30

Slide 30 text

Weak Supervision

Slide 31

Slide 31 text

Weak supervision: an overview

Slide 32

Slide 32 text

Weak supervision: lexical rules ● Query Argilla ● Choose Relevant Keywords ● Define Rules ○ human readable ○ quick and easy ○ no programming ● Apply rules ● Prepare for training rules within UI

Slide 33

Slide 33 text

Weak supervision: semantic exemplars ● load annotated data ● average embeddings of annotations ● search most similar ● assign annotations as predictions

Slide 34

Slide 34 text

Few-shot learning

Slide 35

Slide 35 text

Few shot learning ● 8 annotations per class ● classy-classification ○ with spaCy or sentence-transformers ○ uses ONNX ● “training” and inference runs in 1 minute (n=500) BONUS ● multi-lingual model => multi-lingual predictions ● solves low resource language problems ○ crosslingual-coreference

Slide 36

Slide 36 text

Active learning

Slide 37

Slide 37 text

Active learning: an overview

Slide 38

Slide 38 text

Active learning ● work smarter ● listen to newly annotated samples ● update your model ● make new predictions demo

Slide 39

Slide 39 text

Training a formal modal

Slide 40

Slide 40 text

ray[train, tune] ● weights and biases

Slide 41

Slide 41 text

Wrap up

Slide 42

Slide 42 text

● what resources DO I have ● be creative ● iteration is key

Slide 43

Slide 43 text

Crosslingual Coreference

Slide 44

Slide 44 text

Concise Concepts

Slide 45

Slide 45 text

Sources ● Argilla ○ Repo https://github.com/argilla-io/argilla ○ Cool Datasets on the hub https://huggingface.co/argilla ○ Deploy with the click of a button ○ LinkedIn https://www.linkedin.com/company/argilla-io/ ● Me ○ My packages https://github.com/Pandora-Intelligence ○ Volunteering https://bonfari.nl/ ● Other Packages ○ Few shot learning ■ SetFit ○ Weak supervision ■ Skweak ○ Active learning ■ small-text

Slide 46

Slide 46 text

Feedback and questions