$30 off During Our Annual Pro Sale. View Details »

Smart shortcuts for bootstrapping a modern NLP project

Smart shortcuts for bootstrapping a modern NLP project

Within NLP, people often struggled with starting projects without decent training data. Nowadays, there are many shortcuts that one can use to get a head start with projects like these by applying techniques such as active learning, weak supervision, few-shot learning, and cross-lingual models. I will show you how to use them and make them work even better using Ray!

Anyscale
PRO

February 23, 2023
Tweet

More Decks by Anyscale

Other Decks in Programming

Transcript

  1. Ray Europe Meetup
    David Berenstein
    Developer Advocate Engineer
    Shortcuts for
    bootstrapping
    NLP pipelines
    22 feb 2023

    View Slide

  2. Agenda
    1. Very brief intro to NLP
    2. Argilla
    3. Cool Shortcuts
    a. Weak supervision
    b. Few-shot learning
    c. Active learning
    4. Wrap-up

    View Slide

  3. A brief intro to NLP
    1. ChatGPT
    2. Text Classification
    3. Lexical vs Semantic

    View Slide

  4. Argilla

    View Slide

  5. What is Argilla?
    Open-source labelling platform for data-centric NLP:
    - Quickly build high quality training data
    - Evaluate and improve models over time
    - Enable collaboration between data teams and
    domain experts

    View Slide

  6. Argilla components
    Argilla Server
    Elasticsearch
    Argilla Python
    Client
    Argilla UI
    Kibana
    Create and update datasets
    with annotations and predictions
    Load datasets for model training
    and evaluation
    Listen to changes in datasets for
    active learning, evaluation, and
    training.
    Programmatic labelling of
    datasets
    Label data manually and with
    with rules based on search
    queries
    Analyse and review model
    predictions and training data
    Models
    Data
    sources
    Store and manage datasets
    Compute dataset and
    record-level metrics
    Dashboards and alerts for
    model monitoring
    Dashboards and alerts for data
    annotation management

    View Slide

  7. Argilla Data Model
    A Dataset is a collection of
    Records.
    Records are defined by their
    NLP task (e.g., Text
    classification).
    Records can contain Metadata,
    text Inputs, model Predictions,
    and “human” Annotations
    https://docs.argilla.io/en/latest/reference/datamodel.html

    View Slide

  8. Argilla Users and Workspaces
    User
    User workspace Team workspace
    Datasets Datasets

    View Slide

  9. Argilla Users and Workspaces

    View Slide

  10. Argilla: Text Classification

    View Slide

  11. Argilla: Token Classification (NER)

    View Slide

  12. Argilla: Text Generation

    View Slide

  13. Argilla Workflow: Training
    Read
    dataset
    Data
    source
    Create
    records
    Model
    Log records
    Explore Label
    Load records
    Prepare for
    training
    Train
    Model

    View Slide

  14. Argilla Workflow: Evaluation
    Metrics
    Explore Label
    Read
    dataset
    Data
    source
    Create
    records
    Model
    Log records

    View Slide

  15. Argilla Workflow: Monitoring
    Metrics
    Explore Label
    Create
    records
    Log records
    (async)
    Model
    Model
    requests
    monitor
    ASGI
    middleware
    Load records
    prepare for
    training

    View Slide

  16. Argilla Python: Create records
    https://docs.argilla.io/en/latest/reference/python/python_client.html#argilla.client.models.TextClassifi
    cationRecord

    View Slide

  17. Argilla Python: Create records (from pandas/datasets)
    https://docs.argilla.io/en/latest/reference/python/python_client.html#module-argilla.client.datasets

    View Slide

  18. Argilla Python: Log records (write data)
    https://docs.argilla.io/en/latest/reference/python/python_client.html#argilla.log

    View Slide

  19. Argilla Python: Load records (read data)
    https://docs.argilla.io/en/latest/reference/python/python_client.html#argilla.load

    View Slide

  20. Argilla Python: Prepare dataset for training
    https://docs.argilla.io/en/latest/reference/python/python_client.html#argilla.client.datasets.
    DatasetForTextClassification.prepare_for_training

    View Slide

  21. Argilla Python: Export records (write dataset to disk)
    https://docs.argilla.io/en/latest/reference/python/python_client.html#argilla.load

    View Slide

  22. Argilla Python: Metrics
    https://docs.argilla.io/en/latest/reference/python/python_metrics.html#python-metrics

    View Slide

  23. News datasets
    ● https://huggingface.co/datasets/argilla/news
    ● Basic Text Classification
    ○ Sci/Tech
    ○ Business
    ○ Sports
    ○ World

    View Slide

  24. Data Exploration - basic UI
    ● Keywords
    ● Text Queries (Lucene QL)
    ○ politics AND president
    ○ sports OR ball
    ○ invest*
    ● Filters

    View Slide

  25. Token attributions - highlight nouns

    View Slide

  26. Similarity search - embeddings

    View Slide

  27. Brief Demo

    View Slide

  28. Data Exploration - UMAP
    ● Fast Sentence Transformers
    ● UMAP
    ● Plotly Express + Chart Studio

    View Slide

  29. Ray[data] - distributed NLP pre-processing
    ● connect to (local) ray cluster
    ● distributed data and models
    ○ clean text
    ○ add POS
    ○ add embeddings

    View Slide

  30. Weak Supervision

    View Slide

  31. Weak supervision: an overview

    View Slide

  32. Weak supervision: lexical rules
    ● Query Argilla
    ● Choose Relevant Keywords
    ● Define Rules
    ○ human readable
    ○ quick and easy
    ○ no programming
    ● Apply rules
    ● Prepare for training
    rules within UI

    View Slide

  33. Weak supervision: semantic exemplars
    ● load annotated data
    ● average embeddings of annotations
    ● search most similar
    ● assign annotations as predictions

    View Slide

  34. Few-shot learning

    View Slide

  35. Few shot learning
    ● 8 annotations per class
    ● classy-classification
    ○ with spaCy or sentence-transformers
    ○ uses ONNX
    ● “training” and inference runs in 1 minute (n=500)
    BONUS
    ● multi-lingual model => multi-lingual predictions
    ● solves low resource language problems
    ○ crosslingual-coreference

    View Slide

  36. Active learning

    View Slide

  37. Active learning: an overview

    View Slide

  38. Active learning
    ● work smarter
    ● listen to newly annotated samples
    ● update your model
    ● make new predictions
    demo

    View Slide

  39. Training a formal modal

    View Slide

  40. ray[train, tune]
    ● weights and biases

    View Slide

  41. Wrap up

    View Slide

  42. ● what resources DO I have
    ● be creative
    ● iteration is key

    View Slide

  43. Crosslingual Coreference

    View Slide

  44. Concise Concepts

    View Slide

  45. Sources
    ● Argilla
    ○ Repo https://github.com/argilla-io/argilla
    ○ Cool Datasets on the hub https://huggingface.co/argilla
    ○ Deploy with the click of a button
    ○ LinkedIn https://www.linkedin.com/company/argilla-io/
    ● Me
    ○ My packages https://github.com/Pandora-Intelligence
    ○ Volunteering https://bonfari.nl/
    ● Other Packages
    ○ Few shot learning
    ■ SetFit
    ○ Weak supervision
    ■ Skweak
    ○ Active learning
    ■ small-text

    View Slide

  46. Feedback and questions

    View Slide