Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Smart shortcuts for bootstrapping a modern NLP project

February 23, 2023

Smart shortcuts for bootstrapping a modern NLP project

Within NLP, people often struggled with starting projects without decent training data. Nowadays, there are many shortcuts that one can use to get a head start with projects like these by applying techniques such as active learning, weak supervision, few-shot learning, and cross-lingual models. I will show you how to use them and make them work even better using Ray!


February 23, 2023

More Decks by Anyscale

Other Decks in Programming


  1. Agenda 1. Very brief intro to NLP 2. Argilla 3.

    Cool Shortcuts a. Weak supervision b. Few-shot learning c. Active learning 4. Wrap-up
  2. What is Argilla? Open-source labelling platform for data-centric NLP: -

    Quickly build high quality training data - Evaluate and improve models over time - Enable collaboration between data teams and domain experts
  3. Argilla components Argilla Server Elasticsearch Argilla Python Client Argilla UI

    Kibana Create and update datasets with annotations and predictions Load datasets for model training and evaluation Listen to changes in datasets for active learning, evaluation, and training. Programmatic labelling of datasets Label data manually and with with rules based on search queries Analyse and review model predictions and training data Models Data sources Store and manage datasets Compute dataset and record-level metrics Dashboards and alerts for model monitoring Dashboards and alerts for data annotation management
  4. Argilla Data Model A Dataset is a collection of Records.

    Records are defined by their NLP task (e.g., Text classification). Records can contain Metadata, text Inputs, model Predictions, and “human” Annotations https://docs.argilla.io/en/latest/reference/datamodel.html
  5. Argilla Workflow: Training Read dataset Data source Create records Model

    Log records Explore Label Load records Prepare for training Train Model
  6. Argilla Workflow: Monitoring Metrics Explore Label Create records Log records

    (async) Model Model requests monitor ASGI middleware Load records prepare for training
  7. Data Exploration - basic UI • Keywords • Text Queries

    (Lucene QL) ◦ politics AND president ◦ sports OR ball ◦ invest* • Filters
  8. Ray[data] - distributed NLP pre-processing • connect to (local) ray

    cluster • distributed data and models ◦ clean text ◦ add POS ◦ add embeddings
  9. Weak supervision: lexical rules • Query Argilla • Choose Relevant

    Keywords • Define Rules ◦ human readable ◦ quick and easy ◦ no programming • Apply rules • Prepare for training rules within UI
  10. Weak supervision: semantic exemplars • load annotated data • average

    embeddings of annotations • search most similar • assign annotations as predictions
  11. Few shot learning • 8 annotations per class • classy-classification

    ◦ with spaCy or sentence-transformers ◦ uses ONNX • “training” and inference runs in 1 minute (n=500) BONUS • multi-lingual model => multi-lingual predictions • solves low resource language problems ◦ crosslingual-coreference
  12. Active learning • work smarter • listen to newly annotated

    samples • update your model • make new predictions demo
  13. Sources • Argilla ◦ Repo https://github.com/argilla-io/argilla ◦ Cool Datasets on

    the hub https://huggingface.co/argilla ◦ Deploy with the click of a button ◦ LinkedIn https://www.linkedin.com/company/argilla-io/ • Me ◦ My packages https://github.com/Pandora-Intelligence ◦ Volunteering https://bonfari.nl/ • Other Packages ◦ Few shot learning ▪ SetFit ◦ Weak supervision ▪ Skweak ◦ Active learning ▪ small-text