Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2023-06-15-odsc

Sofie Van Landeghem
June 21, 2023
310

 2023-06-15-odsc

spaCy: A customizable NLP toolkit designed for developers

Presentation given by Sofie Van Landeghem at ODSC 2023

Sofie Van Landeghem

June 21, 2023
Tweet

Transcript

  1. Sofie Van Landeghem, PhD. Core maintainer of spaCy Open Source

    Team Lead @ Explosion ODSC, London, June 2023 spaCy: A customizable NLP toolkit designed for developers
  2. Natural Language Processing ... There were 26 complete responses (16%)

    and 0 partial responses (0%) … … The median progression-free survival time was 65 months ... CR (%) PR (%) PFS (%) 16 0 65 Sofie Van Landeghem, ODSC 2023 2
  3. Outline Sofie Van Landeghem, ODSC 2023 Part I spaCy: A

    toolkit for Natural Language Processing, designed for developers Part II spacy-llm: Integrating Large Language Models into structured NLP pipelines 3
  4. Sofie Van Landeghem, ODSC 2023 ➢ Free, open-source library ➢

    Designed for production use ➢ Focus on developer productivity ➢ Free course: https://course.spacy.io https://github.com/explosion/spaCy 4 spaCy
  5. spaCy pipelines ➢ A modular, pipeline approach for linguistic analysis

    ➢ Transforming unstructured text into structured data objects like spaCy’s Doc ORG Sofie Van Landeghem, ODSC 2023 5
  6. spaCy: Empower developers Sofie Van Landeghem, ODSC 2023 Business case

    Customize solution Quick prototype ➢ Understand business application ➢ Analyse downstream needs of NLP solution ➢ Pre-trained models ➢ Built-in functionality ➢ Reasonable default settings ➢ Implement custom models & algorithms ➢ Powerful configuration system ➢ Fine-tune iteratively 6
  7. Pre-trained models $ python -m spacy download en_core_web_trf Sofie Van

    Landeghem, ODSC 2023 https://spacy.io/models nlp = spacy.load("en_core_web_trf") doc = nlp(text) for ent in doc.ents: print(ent.text, ent.label_) displacy.serve(doc, style="ent") 7
  8. Use-case: clinical trial results Hemodynamic Effects of Phenylephrine, Vasopressin, and

    Epinephrine in Children With Pulmonary Hypertension: A Pilot Study Abstract Objectives: During a pulmonary hypertensive crisis, the marked increase in pulmonary vascular resistance can result in acute right ventricular failure and death. Currently, there are no therapeutic guidelines for managing an acute crisis. This pilot study examined the hemodynamic effects of phenylephrine, arginine vasopressin, and epinephrine in pediatric patients with pulmonary hypertension. Design: In this prospective, open-label, nonrandomized pilot study, we enrolled pediatric patients previously diagnosed with pulmonary hypertensive who were scheduled electively for cardiac catheterization. Primary outcome was a change in the ratio of pulmonary-to-systemic vascular resistance. Baseline hemodynamic data were collected before and after the study drug was administered. Patients: Eleven of 15 participants were women, median age was 9.2 years (range, 1.7-14.9 yr), and median weight was 26.8 kg (range, 8.5-55.2 kg). Baseline mean pulmonary artery pressure was 49 ± 19 mm Hg, and mean indexed pulmonary vascular resistance was 10 ± 5.4 Wood units. Etiology of pulmonary hypertensive varied, and all were on systemic pulmonary hypertensive medications. Interventions: Patients 1-5 received phenylephrine 1 g/kg; patients 6-10 received arginine vasopressin 0.03 U/kg; and patients 11-15 received epinephrine 1 g/kg. μ μ Hemodynamics was measured continuously for up to 10 minutes following study drug administration. Measurements and main results: After study drug administration, the ratio of pulmonary-to-systemic vascular resistance decreased in three of five patients receiving phenylephrine, five of five patients receiving arginine vasopressin, and three of five patients receiving epinephrine. Although all three medications resulted in an increase in aortic pressure, only arginine vasopressin consistently resulted in a decrease in the ratio of systolic pulmonary artery-to-aortic pressure. Conclusions: This prospective pilot study of phenylephrine, arginine vasopressin, and epinephrine in pediatric patients with pulmonary hypertensive showed an increase in aortic pressure with all drugs although only vasopressin resulted in a consistent decrease in the ratio of pulmonary-to-systemic vascular resistance. Studies with more subjects are warranted to define optimal dosing strategies of these medications in an acute pulmonary hypertensive crisis. Stephanie L Siehr, Jeffrey A Feinstein, Weiguang Yang, Lynn F Peng, Michelle T Ogawa, Chandra Ramamoorthy. Pediatr Crit Care Med (2016) PMID: 27144689 Sofie Van Landeghem, ODSC 2023 8
  9. Goal: Identify treatments and outcomes Patients: Eleven of 15 participants

    were women, median age was 9.2 years (range, 1.7-14.9 yr), and median weight was 26.8 kg (range, 8.5-55.2 kg). Baseline mean pulmonary artery pressure was 49 ± 19 mm Hg, and mean indexed pulmonary vascular resistance was 10 ± 5.4 Wood units. Etiology of pulmonary hypertensive varied, and all were on systemic pulmonary hypertensive medications. Interventions: Patients 1-5 received phenylephrine 1 g/kg; patients 6-10 received arginine vasopressin 0.03 μ U/kg; and patients 11-15 received epinephrine 1 g/kg. μ Hemodynamics was measured continuously for up to 10 minutes following study drug administration. Measurements and main results: After study drug administration, the ratio of pulmonary-to-systemic vascular resistance decreased in three of five patients receiving phenylephrine, five of five patients receiving arginine vasopressin, and three of five patients receiving epinephrine. Although all three medications resulted in an increase in aortic pressure, only arginine vasopressin consistently resulted in a decrease in the ratio of systolic pulmonary artery-to-aortic pressure. Sofie Van Landeghem, ODSC 2023 9 → The pre-trained English models are not tailored to biomedical texts, so there is no out-of-the-box functionality for patient groups, drugs, etc
  10. Dependency parsing nlp = spacy.load("en_core_web_trf") doc = nlp(sentence) options =

    {"collapse_punct": False} displacy.serve(doc, style="dep", options=options) Sofie Van Landeghem, ODSC 2023 → How can we leverage this information? 10
  11. spaCy’s Matcher matcher = Matcher(nlp.vocab) pattern = [ {"LOWER": "patients"},

    {"POS": {"IN": ["SYM", "NUM", "PUNCT"]}, "OP": "+"}, {"LOWER": "received"}, {"POS": {"IN": ["ADJ", "NOUN", "NUM", "ADP", "SYM"]}, "OP": "+"}] matcher.add("TreatmentGroup", [pattern]) matches = matcher(doc, as_spans=True) matches = util.filter_spans(matches) for span in matches: print(span.text) > Patients 1-5 received phenylephrine 1 μg/kg > patients 6-10 received arginine vasopressin 0.03 U/kg > patients 11-15 received epinephrine 1 μg/kg Sofie Van Landeghem, ODSC 2023 → Matcher rules help you explore the data & get a hands-on feel of the complexity of the task → They can help bootstrap annotation through weak labeling https://spacy.io/usage/rule-based-matching 11
  12. Training supervised models • Typically, you want to train a

    supervised model tailored to your domain & business case • In this case, we probably need (at least) ➢ NER & spancat: identify patient groups, drugs, doses, frequencies, outcomes, … ➢ Relation extraction: find the correct relations between patient groups, treatments and outcomes. Sofie Van Landeghem, ODSC 2023 https://spacy.io/usage/training 12
  13. Config file: capture all training settings Sofie Van Landeghem, ODSC

    2023 [nlp] lang = "en" pipeline = ["tok2vec","ner","spancat"] batch_size = 1000 [training] seed = 342 dropout = 0.1 max_steps = 20000 ... [components.spancat] factory = "spancat" spans_key = "sc" [components.spancat.model] @architectures = "spacy.SpanCategorizer.v1" [components.ner] factory = "ner" ... → A config file allows for serializability & reproducability of your NLP pipelines → spaCy has built-in architectures for NER, spancat, textcat, tagger, dependency parser, … → You can also implement and register your own models and components! https://github.com/explosion/projects/tree/v3/tutorials/rel_component 14 $ python -m spacy init config my_config.cfg --lang en --pipeline ner,spancat
  14. Training a supervised model Sofie Van Landeghem, ODSC 2023 $

    python -m spacy train my_config.cfg --output ./my_output E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE --- ------ ------------ -------- ------ ------ ------ ------ 0 0 0.00 23.79 0.00 0.00 0.00 0.00 6 200 105.40 2586.38 37.21 57.14 27.59 0.37 14 400 255.98 360.81 40.00 47.62 34.48 0.40 23 600 60.01 47.55 34.04 44.44 27.59 0.34 33 800 35.52 20.49 40.00 47.62 34.48 0.40 45 1000 89.50 36.39 32.00 38.10 27.59 0.32 59 1200 47.41 22.91 43.90 75.00 31.03 0.44 ... Saves best & last trained model to the specified output directory. You can load it as an ‘nlp’ object to use for inference / further fine-tuning. nlp = spacy.load("my_output/model-best") doc = nlp(text) 15
  15. ChatGPT (OpenAI) Summarize the trial results in a structured fashion

    like so: Patient group: <name> Number of patients in the group: <number> Treatment drug or substance: <drug> Treatment dose: <drug> Treatment frequency of administration: <frequency> Treatment duration: <duration> Outcome: <outcome> Patient group: Phenylephrine Group Number of patients in the group: 5 Treatment drug or substance: Phenylephrine Treatment dose: 1 μg/kg Treatment frequency of administration: Single administration Treatment duration: Not specified Outcome: The ratio of pulmonary-to-systemic vascular resistance decreased in three of five patients receiving phenylephrine. Increase in aortic pressure observed. Patient group: Arginine Vasopressin Group Number of patients in the group: 5 Treatment drug or substance: Arginine vasopressin Treatment dose: 0.03 U/kg Treatment frequency of administration: Single administration Treatment duration: Not specified Outcome: The ratio of pulmonary-to-systemic vascular resistance decreased in all five patients receiving arginine vasopressin. Increase in aortic pressure observed. Consistent decrease in the ratio of systolic pulmonary artery-to-aortic pressure observed. Patient group: Epinephrine Group (...) Patients 1-5 received phenylephrine 1 g/kg μ ; patients 6-10 received arginine vasopressin 0.03 U/kg; and patients 11-15 received epinephrine 1 g/kg μ (...) After study drug administration, the ratio of pulmonary-to- systemic vascular resistance decreased in three of five patients receiving phenylephrine, five of five patients receiving arginine vasopressin, and three of five patients receiving epinephrine. Sofie Van Landeghem, ODSC 2023 ChatGPT Input text LLM output 17
  16. ChatGPT in production? Sofie Van Landeghem, ODSC 2023 Free web

    interface No API No batching No robustness No data privacy No reproducibility Whole new meaning of “black box” 18
  17. Integrate LLMs into NLP solutions spacy-llm: integrate LLMs into production-ready,

    structured NLP pipelines • Backends: ➢ External APIs, e.g. OpenAI, Cohere, Anthropic ➢ Open-source models, e.g. Dolly v2, OpenLLaMa, StableLM (via HuggingFace hub) ➢ Connect your favourite model by writing a custom backend! • Tasks: ➢ Define prompt to send to the LLM ➢ Parse the LLM’s response and turn this into structured annotations on spaCy’s Doc objects ➢ Write a custom task definition for your specific use-case! Sofie Van Landeghem, ODSC 2023 https://github.com/explosion/spacy-llm 19
  18. spacy-llm: Empower developers Sofie Van Landeghem, ODSC 2023 Business case

    Customize solution Quick prototype ➢ Understand business application ➢ Analyse downstream needs of NLP solution ➢ Built-in backends/models ➢ Built-in tasks ➢ Reasonable default settings ➢ Implement custom tasks & backends ➢ Powerful configuration system ➢ Fine-tune iteratively 20
  19. Using built-in NER functionality Sofie Van Landeghem, ODSC 2023 [nlp]

    lang = "en" pipeline = ["llm"] [components] [components.llm] factory = "llm" [components.llm.backend] @llm_backends = "spacy.REST.v1" api = "OpenAI" [components.llm.backend.config] model: "gpt-3.5-turbo" [components.llm.task] @llm_tasks = "spacy.NER.v2" labels = "Drug,Dose" my_config.cfg 21 Zero-shot results with spacy-llm: from spacy_llm.util import assemble text = _read_trial(pmid=27144689) nlp = assemble(_MY_CONFIG_DIR / "my_config.cfg") doc = nlp(text)
  20. Easily swap in other backends/tasks Sofie Van Landeghem, ODSC 2023

    my_config.cfg 22 [components.llm.backend] @llm_backends = "spacy.DollyHF.v1" model = "databricks/dolly-v2-12b" [nlp] lang = "en" pipeline = ["llm"] [components] [components.llm] factory = "llm" [components.llm.backend] @llm_backends = "spacy.REST.v1" api = "OpenAI" [components.llm.backend.config] model: "gpt-3.5-turbo" [components.llm.task] @llm_tasks = "spacy.NER.v2" labels = "Drug,Dose" [components.llm.task] @llm_tasks = "spacy.NER.v2" labels = "Patient_group,Treatment" [components.llm.task] @llm_tasks = "spacy.TextCat.v2" labels = "Trial,Patent,News,Research"
  21. Writing a custom task Sofie Van Landeghem, ODSC 2023 INSTRUCTION

    = """ Summarize the trial results in a structured fashion like so: Patient group: <name> Number of patients in the group: <number> Treatment drug or substance: <drug> Treatment dose: <drug> Treatment frequency of administration: <frequency> Treatment duration: <duration> Outcome: <outcome> """ class TrialSummaryTask: def generate_prompts(self, docs: Iterable[Doc]) -> Iterable[str]: for doc in docs: prompt = "Below this instruction, I will provide you with a clinical trial abstract. " prompt += INSTRUCTION + doc.text yield prompt def parse_responses(self, docs: Iterable[Doc], responses: Iterable[str]) -> Iterable[Doc]: ... @registry.llm_tasks("hedgy.TrialSummary.v1") def make_trial_task(): return TrialSummaryTask() 23
  22. gpt-3.5-turbo (spacy-llm) Summarize the trial results in a structured fashion

    like so: Patient group: <name> Number of patients in the group: <number> Treatment drug or substance: <drug> Treatment dose: <drug> Treatment frequency of administration: <frequency> Treatment duration: <duration> Outcome: <outcome> Patient group: Pediatric patients with pulmonary hypertension Number of patients in the group: 15 Treatment drug or substance: Phenylephrine, arginine vasopressin, and epinephrine Treatment dose: Phenylephrine 1 μg/kg, arginine vasopressin 0.03 U/kg, epinephrine 1 μg/kg Treatment frequency of administration: Administered once Treatment duration: Hemodynamic measurements taken continuously for up to 10 minutes following drug administration Outcome: All three medications resulted in an increase in aortic pressure. Only arginine vasopressin consistently resulted in a decrease in the ratio of systolic pulmonary artery-to-aortic pressure. The ratio of pulmonary-to-systemic vascular resistance decreased in three out of five patients receiving phenylephrine, five out of five patients receiving arginine vasopressin, and three out of five patients receiving epinephrine. Studies with more patients are needed to define optimal dosing strategies for these medications in an acute pulmonary hypertensive crisis. Patients 1-5 received phenylephrine 1 g/kg; patients 6-10 received μ arginine vasopressin 0.03 U/kg; and patients 11-15 received epinephrine 1 g/kg μ (...) After study drug administration, the ratio of pulmonary-to-systemic vascular resistance decreased in three of five patients receiving phenylephrine, five of five patients receiving arginine vasopressin, and three of five patients receiving epinephrine. Although all three medications resulted in an increase in aortic pressure, only arginine vasopressin consistently resulted in a decrease in the ratio of systolic pulmonary artery-to-aortic pressure. Sofie Van Landeghem, ODSC 2023 GPT 3.5 [components.llm.task] @llm_tasks = "hedgy.TrialSummary.v1" Input text LLM output 24
  23. Reminder: ChatGPT (browser) Summarize the trial results in a structured

    fashion like so: Patient group: <name> Number of patients in the group: <number> Treatment drug or substance: <drug> Treatment dose: <drug> Treatment frequency of administration: <frequency> Treatment duration: <duration> Outcome: <outcome> Patient group: Phenylephrine Group Number of patients in the group: 5 Treatment drug or substance: Phenylephrine Treatment dose: 1 μg/kg Treatment frequency of administration: Single administration Treatment duration: Not specified Outcome: The ratio of pulmonary-to-systemic vascular resistance decreased in three of five patients receiving phenylephrine. Increase in aortic pressure observed. Patient group: Arginine Vasopressin Group Number of patients in the group: 5 Treatment drug or substance: Arginine vasopressin Treatment dose: 0.03 U/kg Treatment frequency of administration: Single administration Treatment duration: Not specified Outcome: The ratio of pulmonary-to-systemic vascular resistance decreased in all five patients receiving arginine vasopressin. Increase in aortic pressure observed. Consistent decrease in the ratio of systolic pulmonary artery-to-aortic pressure observed. Patient group: Epinephrine Group (...) Patients 1-5 received phenylephrine 1 g/kg μ ; patients 6-10 received arginine vasopressin 0.03 U/kg; and patients 11-15 received epinephrine 1 g/kg μ (...) After study drug administration, the ratio of pulmonary-to- systemic vascular resistance decreased in three of five patients receiving phenylephrine, five of five patients receiving arginine vasopressin, and three of five patients receiving epinephrine. Sofie Van Landeghem, ODSC 2023 ChatGPT Input text LLM output 25
  24. Prompt engineering Summarize the trial results in a structured fashion.

    First, identify all patient groups with distinct treatments. Then, for each patient group, write down the following: Patient group: <name> Number of patients in the group: <number> Treatment drug or substance: <drug> Treatment dose: <drug> Treatment frequency of administration: <frequency> Treatment duration: <duration> Outcome: <outcome> Sofie Van Landeghem, ODSC 2023 GPT 3.5 [components.llm.task] @llm_tasks = "hedgy.TrialSummary.v2" Patient group: Phenylephrine group Number of patients in the group: 5 Treatment drug or substance: Phenylephrine Treatment dose: 1 μg/kg Treatment frequency of administration: One-time administration Treatment duration: Not specified Outcome: Ratio of pulmonary-to-systemic vascular resistance decreased in three out of five patients after study drug administration Patient group: Arginine vasopressin group Number of patients in the group: 5 Treatment drug or substance: Arginine vasopressin Treatment dose: 0.03 U/kg Treatment frequency of administration: One-time administration Treatment duration: Not specified Outcome: Ratio of pulmonary-to-systemic vascular resistance decreased in all five patients after study drug administration. Consistent decrease in the ratio of systolic pulmonary artery-to-aortic pressure noted. Patient group: Epinephrine group (...) Patients 1-5 received phenylephrine 1 g/kg μ ; patients 6-10 received arginine vasopressin 0.03 U/kg; and patients 11-15 received epinephrine 1 g/kg μ (...) After study drug administration, the ratio of pulmonary-to- systemic vascular resistance decreased in three of five patients receiving phenylephrine, five of five patients receiving arginine vasopressin, and three of five patients receiving epinephrine. LLM output Input text 26
  25. Task: parse into structured fields Sofie Van Landeghem, ODSC 2023

    def parse_responses(self, docs: Iterable[Doc], responses: Iterable[str]) -> Iterable[Doc]: for doc, response in zip(docs, responses): patient_groups = [] ... while ... patient_group = response[start_index:end_index].strip() patient_groups.append(patient_group) ... matcher.add("Patient_Group", [nlp.make_doc(text) for text in patient_groups]) ... matches = matcher(doc, as_spans=True) doc.ents = spacy.util.filter_spans(matches) yield doc 27 → Downstream processes can now use the LLM output in a structured way via the Doc object
  26. Reliability & robustness Patient group: Phenylephrine group Number of patients

    in the group: 5 Treatment drug or substance: Phenylephrine 1 μg/kg Treatment dose: As mentioned above Sofie Van Landeghem, ODSC 2023 Number of patients in the group: 15 Treatment drug or substance: Group 1: Patient 1-5 received phenylephrine 1 μg/kg Group 2: Patient 6-10 received arginine vasopressin 0.03 U/kg Group 3: Patient 11-15 received epinephrine 1 μg/kg Treatment frequency of administration “Administered once” “Single administration” “One-time dose” “One time” “Single dose” “One-time administration” “once” openai.error.RateLimitError 30
  27. Performance features Sofie Van Landeghem, ODSC 2023 Accuracy Inference speed

    Memory usage Reliability / reproducibility Maintainability Customizability Runtime cost Annotation / implementation cost Compute power Quick prototype Interpretability Data privacy 31
  28. Performance trade-offs (2) Sofie Van Landeghem, ODSC 2023 Closed source

    LLMs Open source LLMs 33 Note: make sure to inspect the license and the terms of use!
  29. From prototype to production Sofie Van Landeghem, ODSC 2023 Business

    application Integrate / replace with supervised ML, rules, ... Quick prototype (LLM) Fine-tune 34
  30. Ex 1: LLM-assisted annotation Sofie Van Landeghem, ODSC 2023 LLM

    zero-shot predictions https://prodigy.ai/features/large-language-models Manual curation Evaluation data - Measure pipeline performance Training data - Train a supervised model 35 Examples for few-shot learning - Tune the LLM
  31. Ex 2: Pre-process texts Sofie Van Landeghem, ODSC 2023 PII

    NER LLM ➢ Avoid sending sensitive data to third parties ➢ Recognize & replace Personal Identifiable Information 36
  32. Ex 3: Filter input texts Sofie Van Landeghem, ODSC 2023

    TextCat NER ➢ Only send texts/sentences with certain topics/entities to the LLM ➢ Avoid inducing unncessary costs ➢ Adjust prompt according to earlier classification and/or identified entities ➢ ... LLM 37
  33. Ex 4: Post-process LLM responses Sofie Van Landeghem, ODSC 2023

    LLM Entity linking ➢ Normalize the (free-text) LLM responses ➢ Connect to a knowledge base (e.g. through entity linking) ➢ Make the (unpredictable) LLM responses more robust for ingestion by downstream processes ➢ ... 38 Rules
  34. Recap Sofie Van Landeghem, ODSC 2023 • NLP unlocks information

    from text and makes it available to down-stream business applications in a structured form • Large Language Models have impressive text generation/understanding abilities • It’s become super easy to prototype NLP applications with LLMs • When building a production-ready pipeline, you need to consider other traits such as customizability, robustness, inference cost, network latency, etc. • spaCy is a production-ready NLP framework written for developers • Its latest extension spacy-llm allows easy integration of LLMs into structured NLP pipelines • LLM-assisted annotation allows fast bootstrapping of training/evaluation data 39