2024-01-23-az

spacy-llm From quick prototyping with LLMs to more reliable and
efficient NLP solutions Soﬁe Van Landeghem, PhD. Core maintainer of spaCy @ Explosion NLP and ML freelancer @ OxyKodit

Brieﬂy about me ... • 2008 - 2012: PhD in
BioNLP, mostly working on Biomedical event extraction ◦ SVMs were the new kid on the block - feature engineering is fun! • 2013 - 2014: PostDoc in Bioinformatics, combining BioNLP with network analysis ◦ Seeing the rise of word embeddings with the publication of the word2vec paper by Mikolov et al. • 2015 - 2018: Data scientist @ Johnson & Johnson ◦ Bridging the gap between state-of-the-art NLP and (often harsh) business reality • 2019 - 2024: Freelancing + maintainer of open-source NLP toolbox spaCy ◦ Front row seat to the transformer-based revolution and LLM disruption 2 Soﬁe Van Landeghem - NLP Community - January 23, 2024

This talk ... • Will explain the design principles of
the open-source toolbox spaCy • Will showcase how to use its recent plugin spacy-llm to perform rapid prototyping with Large Language Models (LLMs) • Will demonstrate how to move beyond a prototype into a more reliable, eﬃcient, and maintainable solution • Will use clinical trial analysis as an example application 3 Soﬁe Van Landeghem - NLP Community - January 23, 2024

Clinical trial abstract Hemodynamic Effects of Phenylephrine, Vasopressin, and Epinephrine
in Children With Pulmonary Hypertension: A Pilot Study Objectives: During a pulmonary hypertensive crisis, the marked increase in pulmonary vascular resistance can result in acute right ventricular failure and death. Currently, there are no therapeutic guidelines for managing an acute crisis. This pilot study examined the hemodynamic effects of phenylephrine, arginine vasopressin, and epinephrine in pediatric patients with pulmonary hypertension. Design: In this prospective, open-label, nonrandomized pilot study, we enrolled pediatric patients previously diagnosed with pulmonary hypertensive who were scheduled electively for cardiac catheterization. Primary outcome was a change in the ratio of pulmonary-to-systemic vascular resistance. Baseline hemodynamic data were collected before and after the study drug was administered. Patients: Eleven of 15 participants were women, median age was 9.2 years (range, 1.7-14.9 yr), and median weight was 26.8 kg (range, 8.5-55.2 kg). Baseline mean pulmonary artery pressure was 49 ± 19 mm Hg, and mean indexed pulmonary vascular resistance was 10 ± 5.4 Wood units. Etiology of pulmonary hypertensive varied, and all were on systemic pulmonary hypertensive medications. Interventions: Patients 1-5 received phenylephrine 1 μg/kg; patients 6-10 received arginine vasopressin 0.03 U/kg; and patients 11-15 received epinephrine 1 μg/kg. Hemodynamics was measured continuously for up to 10 minutes following study drug administration. Measurements and main results: After study drug administration, the ratio of pulmonary-to-systemic vascular resistance decreased in three of five patients receiving phenylephrine, five of five patients receiving arginine vasopressin, and three of five patients receiving epinephrine. Although all three medications resulted in an increase in aortic pressure, only arginine vasopressin consistently resulted in a decrease in the ratio of systolic pulmonary artery-to-aortic pressure. Conclusions: This prospective pilot study of phenylephrine, arginine vasopressin, and epinephrine in pediatric patients with pulmonary hypertensive showed an increase in aortic pressure with all drugs although only vasopressin resulted in a consistent decrease in the ratio of pulmonary-to-systemic vascular resistance. Studies with more subjects are warranted to define optimal dosing strategies of these medications in an acute pulmonary hypertensive crisis. Stephanie L Siehr, Jeffrey A Feinstein, Weiguang Yang, Lynn F Peng, Michelle T Ogawa, Chandra Ramamoorthy. Pediatr Crit Care Med (2016) PMID: 27144689 4

Clinical trial abstract - treatment groups Design: In this prospective,
open-label, nonrandomized pilot study, we enrolled pediatric patients previously diagnosed with pulmonary hypertensive who were scheduled electively for cardiac catheterization. (...) Patients: Eleven of 15 participants were women, median age was 9.2 years (range, 1.7-14.9 yr), and median weight was 26.8 kg (range, 8.5-55.2 kg). Baseline mean pulmonary artery pressure was 49 ± 19 mm Hg, and mean indexed pulmonary vascular resistance was 10 ± 5.4 Wood units. Etiology of pulmonary hypertensive varied, and all were on systemic pulmonary hypertensive medications. Interventions: Patients 1-5 received phenylephrine 1 μg/kg; patients 6-10 received arginine vasopressin 0.03 U/kg; and patients 11-15 received epinephrine 1 μg/kg. Hemodynamics was measured continuously for up to 10 minutes following study drug administration. 5 Soﬁe Van Landeghem - NLP Community - January 23, 2024

Clinical trial abstract - outcomes Design: (...) Primary outcome was
a change in the ratio of pulmonary-to-systemic vascular resistance. (...) Measurements and main results: After study drug administration, the ratio of pulmonary-to-systemic vascular resistance decreased in three of five patients receiving phenylephrine, five of five patients receiving arginine vasopressin, and three of five patients receiving epinephrine. Although all three medications resulted in an increase in aortic pressure, only arginine vasopressin consistently resulted in a decrease in the ratio of systolic pulmonary artery-to-aortic pressure. 6 Sofie Van Landeghem - NLP Community - January 23, 2024

Clinical trial abstract - NLP output # patients Treatment Drug
Treatment Dose Outcome: Decreased ratio of PVR to SVR Group 1 5 phenylephrine 1 μg/kg 3 Group 2 5 arginine vasopressin 0.03 U/kg 5 Group 3 5 epinephrine 1 μg/kg 3 Ideally, you want your NLP solution to extract a structured summary: 7 Soﬁe Van Landeghem - NLP Community - January 23, 2024

NLP complexity of this challenge (1) • Named Entities like
drugs, diseases, ... ◦ Standard NLP challenge, pre-trained models often exist ◦ https://github.com/AstraZeneca/KAZU ;-) • Treatment dose and frequency ◦ Probably pretty doable with some type of pattern matching • Patient/treatment groups ◦ Non-standard, challenging NLP target ◦ Groups can be unique because of diﬀerent prior conditions, prior treatments, patient characteristics, behavioural patterns, treatment drug or dose, treatment frequency, ... ◦ Group size can be mentioned in various ways, e.g. "5 patients" or "3 women and 2 men" 8 Soﬁe Van Landeghem - NLP Community - January 23, 2024

NLP complexity of this challenge (2) • Primary/secondary endpoints of
the study ◦ Can be partly dictionary-based: e.g. "Progression-free survival", "PFS", "CR", ... ◦ Can be much more complex: e.g. "a change in the ratio of pulmonary-to-systemic vascular resistance" • Outcomes per patient group & endpoint ◦ Requires linking across diﬀerent paragraphs, resolving references and synonyms, and understanding complex sentence structures ◦ As NLP tasks go, this one's pretty challenging! ◦ With a classical NLP approach, might require a mix of NER, entity linking, coreference resolution, relation extraction 9 Soﬁe Van Landeghem - NLP Community - January 23, 2024

LLMs to the rescue? Generative vs predictive tasks • LLMs
are primarily used for text generation ◦ Often user-facing tasks ◦ Text summarization, question answering, writing a poem, etc. • They can be useful for structured NLP as well ◦ Extracting structured attributes such as named entities, part-of-speech tags, ... ◦ Better allows automated integration with downstream applications ▪ Can we extract a structured table of results from the clinical trial abstract? 10 Soﬁe Van Landeghem - NLP Community - January 23, 2024

Prototype with ChatGPT - prompt Summarize the trial results in
a structured fashion like so: Patient group: <name> Number of patients in the group: <number> Treatment drug or substance: <drug> Treatment dose: <dose> Treatment frequency of administration: <frequency> Treatment duration: <duration> Outcome: <outcome> 11 Soﬁe Van Landeghem - NLP Community - January 23, 2024

Prototype with ChatGPT - output Patient group: Phenylephrine Group Number
of patients in the group: 5 Treatment drug or substance: Phenylephrine Treatment dose: 1 μg/kg Treatment frequency of administration: Single administration Treatment duration: Not specified Outcome: The ratio of pulmonary-to-systemic vascular resistance decreased in three of five patients receiving phenylephrine. Increase in aortic pressure observed. Patient group: Arginine Vasopressin Group Number of patients in the group: 5 Treatment drug or substance: Arginine vasopressin Treatment dose: 0.03 U/kg Treatment frequency of administration: Single administration Treatment duration: Not specified Outcome: The ratio of pulmonary-to-systemic vascular resistance decreased in all five patients receiving arginine vasopressin. Increase in aortic pressure observed. Consistent decrease in the ratio of systolic pulmonary artery-to-aortic pressure observed. Patient group: Epinephrine Group (...) 12 Sofie Van Landeghem - NLP Community - January 23, 2024

A few observations ... 13 • ChatGPT manages to link
information spread across paragraphs to each other, e.g. the drug + dose + outcome per patient group • ChatGPT assumes that no mention of frequency of administration in the abstract, equals "single administration" • ChatGPT is able to admit what it doesn't know, e.g. "Not specified" for the Treatment duration (which is, indeed, not specified in the abstract) • Not bad for a quick prototype! Sofie Van Landeghem - NLP Community - January 23, 2024

Disadvantages of using the ChatGPT web interface 14 • No
API • No batching • No robustness • No data privacy • No reproducibility • Whole new meaning of "black box" Soﬁe Van Landeghem - NLP Community - January 23, 2024

spacy-llm: Integrating LLMs into structured NLP pipelines 15 • Support
for external API's (OpenAI, Cohere, Anthropic, ...) as well as open-source models (via HuggingFace) • Built-in support for various standard NLP tasks such as text classiﬁcation, NER, relation extraction, text summarization, ... • Relatively easy to implement your own, custom tasks • https://github.com/explosion/spacy-llm Soﬁe Van Landeghem - NLP Community - January 23, 2024

spacy-llm follows the main design principles of spaCy 16 •
Free, open-source library • Designed for production use • Focus on developer productivity ◦ Built-in functionality to help you hit the ground running ◦ Customizability & extensibility of the framework to implement anything your use-case needs • Reproducibility of experiments by using a detailed config file • Use rich data structures for results and metadata • Break NLP challenge down into a pipeline of highly specific chained tasks • https://github.com/explosion/spacy Sofie Van Landeghem - NLP Community - January 23, 2024

Built-in zero-shot NER with spacy-llm 17 my_conﬁg.cfg [nlp] lang =
"en" pipeline = ["llm"] batch_size = 128 [components] [components.llm] factory = "llm" [components.llm.model] @llm_models = "spacy.GPT-4.v2" [components.llm.task] @llm_tasks = "spacy.NER.v2" labels = ["Drug", "Dose"] from spacy_llm.util import assemble text = _read_trial(pmid=27144689) nlp = assemble(config_path) doc = nlp(text) my_script.py Soﬁe Van Landeghem - NLP Community - January 23, 2024

Zero-shot NER with LLMs 18 • Performance highly dependent on
the label(s) ◦ How commonly known these types of entities are ◦ How descriptive & accurate the label text is, e.g. ▪ "Dose" vs. "TreatmentDose" ▪ "Drug" vs. "Chemical" etc • Reproducibility can be tricky because the LLM's responses may vary ◦ For classification (not generation) tasks, you'll typically want to set temperature to 0.0 ◦ You can provide model-specific parameters in the config file: [components.llm.model] @llm_models = "spacy.GPT-4.v2" config = {"seed": 342, "temperature": 0.0} Sofie Van Landeghem - NLP Community - January 23, 2024

Few-shot NER "Chain-of-thought" prompting 19 • Based on the PromptNER
paper by Ashok and Lipton (2023) • Asking the LLM to explain its reasoning - giving it "tokens to think" • Reimplemented in spacy-llm, available as spacy.NER.v3 • Increase of 15 percentage points F-score on an internal use-case • Works best when providing label deﬁnitions and examples - these allow you to tune the prompt towards the desired results Soﬁe Van Landeghem - NLP Community - January 23, 2024

Built-in few-shot NER with spacy-llm (1) 20 my_conﬁg.cfg [components.llm.task] @llm_tasks
= "spacy.NER.v3" labels = ["Drug", "Dose"] description = Entities are drugs or their doses. They can be uppercased, title-cased, or lowercased. Each occurrence of an entity in the text should be extracted. [components.llm.task.label_definitions] Drug = "A medicine or drug given to a patient as a treatment. Can be a generic name or brand name, e.g. paracetamol, Aspirin" Dose = "The measured quantity (dose) of a certain medicine given to patients, e.g. 1mg. This should exclude the drug name." [components.llm.task.examples] @misc = "spacy.FewShotReader.v1" path = "my_fewshot.json" Soﬁe Van Landeghem - NLP Community - January 23, 2024

Built-in few-shot NER with spacy-llm (2) 21 my_fewshot.json "text": "The
patient was given 1mg of paracetamol.", "spans": [ { "text": "paracetamol", "is_entity": true, "label": "Drug", "reason": "is a drug name, used as medication" }, { "text": "1mg", "is_entity": true, "label": "Dose", "reason": "is the quantity or dose of the given medication" }, { "text": "patient", "is_entity": false, "label": "==NONE==", "reason": "is a person, not a drug or dose" } ] Soﬁe Van Landeghem - NLP Community - January 23, 2024

Implementation of a custom task 22 my_task.py INSTRUCTION = "Summarize
the following clinical trial in a structured fashion. (...)" @registry.llm_tasks("tutorial.TrialSummary.v1") def make_trial_task() -> "TrialSummaryTask": return TrialSummaryTask(INSTRUCTION) class TrialSummaryTask(LLMTask): def __init__(self, instruction: str): self.instruction = instruction def generate_prompts(self, docs): for doc in docs: yield self.instruction + "\n\n" + doc.text def parse_responses(self, docs): ... my_conﬁg.cfg [nlp] lang = "en" pipeline = ["llm"] batch_size = 128 [components] [components.llm] factory = "llm" [components.llm.model] @llm_models = "spacy.GPT-4.v2" [components.llm.task] @llm_tasks = "tutorial.TrialSummary.v1" Soﬁe Van Landeghem - NLP Community - January 23, 2024

Output from the spacy-llm pipeline using GPT-4 Patient group: Group
1 Number of patients in the group: 5 Treatment drug or substance: Phenylephrine Treatment dose: 1 μg/kg Treatment frequency of administration: Single dose Treatment duration: Up to 10 minutes following study drug administration Outcome: The ratio of pulmonary-to-systemic vascular resistance decreased in three of five patients. An increase in aortic pressure was observed. Patient group: Group 2 Number of patients in the group: 5 Treatment drug or substance: Arginine vasopressin Treatment dose: 0.03 U/kg Treatment frequency of administration: Single dose Treatment duration: Up to 10 minutes following study drug administration Outcome: The ratio of pulmonary-to-systemic vascular resistance decreased in all five patients. An increase in aortic pressure was observed. Arginine vasopressin consistently resulted in a decrease in the ratio of systolic pulmonary artery-to-aortic pressure. Patient group: Group 3 (...) 23 Sofie Van Landeghem - NLP Community - January 23, 2024

Unfortunately, the output is still text ... 24 • The
"outcome" field contains full sentences • Different ways LLMs express a "single" treatment frequency: ◦ Administered once ◦ Single administration ◦ One-time dose ◦ One time ◦ Single dose ◦ One-time administration ◦ Once • You still need to post-process the results to structured fields Sofie Van Landeghem - NLP Community - January 23, 2024

From text to structured data: a pipeline approach 25 .txt
file spaCy pipeline Doc object [nlp] lang = "en" pipeline = ["llm_ner", "entity_linker"] .txt file llm_ner Doc object entity_linker doc.ents ent.text ="phenylephrine" ent.label = "Drug" Sofie Van Landeghem - NLP Community - January 23, 2024 doc.ents ent.text ="phenylephrine" ent.label = "Drug" ent.kb_id = CHEMBL1215

From a trial text to structured output 26 .txt ﬁle
llm_trial Doc object normalizer Parse responses { "Group 1": { "Number of patients": 5 "Drug": "Phenylephrine" "Dose": 1 μg/kg "Frequency": "Single dose" "Outcome":"The ratio of pulmonary-to-systemic vascular resistance decreased in three of five patients." }, "Group 2": ... } Soﬁe Van Landeghem - NLP Community - January 23, 2024 { "Group 1": { "Number of patients": 5 "Drug": "CHEMBL1215" "Dose": 1 μg/kg "Frequency": 1/trial "Outcome": { "Ratio PVR to SVR": { "Decrease": 3 } } }, "Group 2": ... } entity_linker Standardize, summarize

Swap out the LLM backend 27 • Closed-source LLM ◦
Sometimes better accuracy out-of-the box ◦ Service can be unreliable (time-out, rate limits, ...) ◦ All data sent to a third party ◦ Often costly • Open-source LLM ◦ Can be customized / ﬁne-tuned ◦ More reliable ◦ Requires dedicated hardware ◦ Data privacy [components.llm] factory = "llm" [components.llm.model] @llm_models = "spacy.GPT-4.v2" [components.llm] factory = "llm" [components.llm.model] @llm_models = "spacy.Mistral.v1" name = "Mistral-7B-v0.1" Soﬁe Van Landeghem - NLP Community - January 23, 2024

Swap out a component architecture 28 • LLM ◦ Quick
prototyping ◦ Can be unreliable/unstable ◦ Expensive • Supervised Machine Learning ◦ Manual annotation effort ◦ Faster & more reliable inference ◦ Train your own or source a pretrained model ◦ Cost-efficient • Rules/patterns ◦ Manual effort & maintenance burden ◦ Higher customizability & interpretability [components.my_ner] factory = "llm" [components.my_ner.task] @llm_tasks = "spacy.NER.v3" [components.my_ner] source = "en_core_sci_lg" name = "ner" [components.my_ner] factory = "span_ruler" annotate_ents = true [components.my_ner] factory = "ner" Sofie Van Landeghem - NLP Community - January 23, 2024

Combine the best of both worlds (example) 29 • Machine
learning textcat model ◦ Identifies topics of sentences, paragraphs or full documents ◦ Classifies the document as relevant or not (e.g. "Clinical trial abstract" or not?) • LLM ◦ Only processes those documents that were deemed relevant in the previous step ◦ Avoid inducing unnecessary costs .txt file textcat Doc object llm Sofie Van Landeghem - NLP Community - January 23, 2024

Annotations are still important 30 • Evaluation ◦ Every project
requires at least a representative evaluation set! ◦ Measure performance of single components ◦ Measure performance of the full pipeline end-to-end ◦ Measure progress while changing/fine-tuning the pipeline • Training a supervised model ◦ Smaller and specialized models can be more cost efficient • Tuning an LLM ◦ Providing "difficult" examples as few-shot examples in the prompt ◦ Actually running a fine tuning learning step of your LLM Sofie Van Landeghem - NLP Community - January 23, 2024

LLM-assisted annotation (NER) 31 Curate zero-shot LLM predictions Soﬁe Van
Landeghem - NLP Community - January 23, 2024

LLM-assisted annotation (REL) 32 Soﬁe Van Landeghem - NLP Community
- January 23, 2024

Summary 33 • LLMs are great for quick prototyping and
for bootstrapping annotation • NLP solutions need to balance various metrics including accuracy, reliability, maintainability, customizability and cost ◦ Mix and match LLMs with supervised models or rule-based components ◦ spaCy pipelines are very versatile ◦ Easily swap out one component while keeping other components in the pipeline the same • spacy-llm lets you easily integrate LLMs into structured NLP pipelines ◦ Swap out backends easily, switching between closed-source LLMs (API) and open-source ones ◦ Use built-in standard NLP tasks ◦ Write your own custom task, ﬁne-tune the prompt, etc Soﬁe Van Landeghem - NLP Community - January 23, 2024

Thanks! 34 • Contact ◦ sofi[email protected] ◦ http://www.oxykodit.com ◦ https://www.linkedin.com/in/sofievanlandeghem/
◦ https://twitter.com/OxyKodit ◦ https://explosion.ai/tailored-solutions • Resources ◦ https://github.com/explosion/spacy-llm/ ◦ https://spacy.io/usage/large-language-models ◦ https://prodi.gy/docs/large-language-models Core maintainer of spaCy @ Explosion NLP and ML freelancer @ OxyKodit Sofie Van Landeghem - NLP Community - January 23, 2024

2024-01-23-az

2024-01-23-az

More Decks by Sofie Van Landeghem

Other Decks in Research

Featured

Transcript