Slide 1

Slide 1 text

spacy-llm From quick prototyping with LLMs to more reliable and efficient NLP solutions Sofie Van Landeghem, PhD. Core maintainer of spaCy @ Explosion NLP and ML freelancer @ OxyKodit

Slide 2

Slide 2 text

Briefly about me ... ● 2008 - 2012: PhD in BioNLP, mostly working on Biomedical event extraction ○ SVMs were the new kid on the block - feature engineering is fun! ● 2013 - 2014: PostDoc in Bioinformatics, combining BioNLP with network analysis ○ Seeing the rise of word embeddings with the publication of the word2vec paper by Mikolov et al. ● 2015 - 2018: Data scientist @ Johnson & Johnson ○ Bridging the gap between state-of-the-art NLP and (often harsh) business reality ● 2019 - 2024: Freelancing + maintainer of open-source NLP toolbox spaCy ○ Front row seat to the transformer-based revolution and LLM disruption 2 Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 3

Slide 3 text

This talk ... ● Will explain the design principles of the open-source toolbox spaCy ● Will showcase how to use its recent plugin spacy-llm to perform rapid prototyping with Large Language Models (LLMs) ● Will demonstrate how to move beyond a prototype into a more reliable, efficient, and maintainable solution ● Will use clinical trial analysis as an example application 3 Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 4

Slide 4 text

Clinical trial abstract Hemodynamic Effects of Phenylephrine, Vasopressin, and Epinephrine in Children With Pulmonary Hypertension: A Pilot Study Objectives: During a pulmonary hypertensive crisis, the marked increase in pulmonary vascular resistance can result in acute right ventricular failure and death. Currently, there are no therapeutic guidelines for managing an acute crisis. This pilot study examined the hemodynamic effects of phenylephrine, arginine vasopressin, and epinephrine in pediatric patients with pulmonary hypertension. Design: In this prospective, open-label, nonrandomized pilot study, we enrolled pediatric patients previously diagnosed with pulmonary hypertensive who were scheduled electively for cardiac catheterization. Primary outcome was a change in the ratio of pulmonary-to-systemic vascular resistance. Baseline hemodynamic data were collected before and after the study drug was administered. Patients: Eleven of 15 participants were women, median age was 9.2 years (range, 1.7-14.9 yr), and median weight was 26.8 kg (range, 8.5-55.2 kg). Baseline mean pulmonary artery pressure was 49 ± 19 mm Hg, and mean indexed pulmonary vascular resistance was 10 ± 5.4 Wood units. Etiology of pulmonary hypertensive varied, and all were on systemic pulmonary hypertensive medications. Interventions: Patients 1-5 received phenylephrine 1 μg/kg; patients 6-10 received arginine vasopressin 0.03 U/kg; and patients 11-15 received epinephrine 1 μg/kg. Hemodynamics was measured continuously for up to 10 minutes following study drug administration. Measurements and main results: After study drug administration, the ratio of pulmonary-to-systemic vascular resistance decreased in three of five patients receiving phenylephrine, five of five patients receiving arginine vasopressin, and three of five patients receiving epinephrine. Although all three medications resulted in an increase in aortic pressure, only arginine vasopressin consistently resulted in a decrease in the ratio of systolic pulmonary artery-to-aortic pressure. Conclusions: This prospective pilot study of phenylephrine, arginine vasopressin, and epinephrine in pediatric patients with pulmonary hypertensive showed an increase in aortic pressure with all drugs although only vasopressin resulted in a consistent decrease in the ratio of pulmonary-to-systemic vascular resistance. Studies with more subjects are warranted to define optimal dosing strategies of these medications in an acute pulmonary hypertensive crisis. Stephanie L Siehr, Jeffrey A Feinstein, Weiguang Yang, Lynn F Peng, Michelle T Ogawa, Chandra Ramamoorthy. Pediatr Crit Care Med (2016) PMID: 27144689 4

Slide 5

Slide 5 text

Clinical trial abstract - treatment groups Design: In this prospective, open-label, nonrandomized pilot study, we enrolled pediatric patients previously diagnosed with pulmonary hypertensive who were scheduled electively for cardiac catheterization. (...) Patients: Eleven of 15 participants were women, median age was 9.2 years (range, 1.7-14.9 yr), and median weight was 26.8 kg (range, 8.5-55.2 kg). Baseline mean pulmonary artery pressure was 49 ± 19 mm Hg, and mean indexed pulmonary vascular resistance was 10 ± 5.4 Wood units. Etiology of pulmonary hypertensive varied, and all were on systemic pulmonary hypertensive medications. Interventions: Patients 1-5 received phenylephrine 1 μg/kg; patients 6-10 received arginine vasopressin 0.03 U/kg; and patients 11-15 received epinephrine 1 μg/kg. Hemodynamics was measured continuously for up to 10 minutes following study drug administration. 5 Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 6

Slide 6 text

Clinical trial abstract - outcomes Design: (...) Primary outcome was a change in the ratio of pulmonary-to-systemic vascular resistance. (...) Measurements and main results: After study drug administration, the ratio of pulmonary-to-systemic vascular resistance decreased in three of five patients receiving phenylephrine, five of five patients receiving arginine vasopressin, and three of five patients receiving epinephrine. Although all three medications resulted in an increase in aortic pressure, only arginine vasopressin consistently resulted in a decrease in the ratio of systolic pulmonary artery-to-aortic pressure. 6 Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 7

Slide 7 text

Clinical trial abstract - NLP output # patients Treatment Drug Treatment Dose Outcome: Decreased ratio of PVR to SVR Group 1 5 phenylephrine 1 μg/kg 3 Group 2 5 arginine vasopressin 0.03 U/kg 5 Group 3 5 epinephrine 1 μg/kg 3 Ideally, you want your NLP solution to extract a structured summary: 7 Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 8

Slide 8 text

NLP complexity of this challenge (1) ● Named Entities like drugs, diseases, ... ○ Standard NLP challenge, pre-trained models often exist ○ https://github.com/AstraZeneca/KAZU ;-) ● Treatment dose and frequency ○ Probably pretty doable with some type of pattern matching ● Patient/treatment groups ○ Non-standard, challenging NLP target ○ Groups can be unique because of different prior conditions, prior treatments, patient characteristics, behavioural patterns, treatment drug or dose, treatment frequency, ... ○ Group size can be mentioned in various ways, e.g. "5 patients" or "3 women and 2 men" 8 Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 9

Slide 9 text

NLP complexity of this challenge (2) ● Primary/secondary endpoints of the study ○ Can be partly dictionary-based: e.g. "Progression-free survival", "PFS", "CR", ... ○ Can be much more complex: e.g. "a change in the ratio of pulmonary-to-systemic vascular resistance" ● Outcomes per patient group & endpoint ○ Requires linking across different paragraphs, resolving references and synonyms, and understanding complex sentence structures ○ As NLP tasks go, this one's pretty challenging! ○ With a classical NLP approach, might require a mix of NER, entity linking, coreference resolution, relation extraction 9 Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 10

Slide 10 text

LLMs to the rescue? Generative vs predictive tasks ● LLMs are primarily used for text generation ○ Often user-facing tasks ○ Text summarization, question answering, writing a poem, etc. ● They can be useful for structured NLP as well ○ Extracting structured attributes such as named entities, part-of-speech tags, ... ○ Better allows automated integration with downstream applications ■ Can we extract a structured table of results from the clinical trial abstract? 10 Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 11

Slide 11 text

Prototype with ChatGPT - prompt Summarize the trial results in a structured fashion like so: Patient group: Number of patients in the group: Treatment drug or substance: Treatment dose: Treatment frequency of administration: Treatment duration: Outcome: 11 Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 12

Slide 12 text

Prototype with ChatGPT - output Patient group: Phenylephrine Group Number of patients in the group: 5 Treatment drug or substance: Phenylephrine Treatment dose: 1 μg/kg Treatment frequency of administration: Single administration Treatment duration: Not specified Outcome: The ratio of pulmonary-to-systemic vascular resistance decreased in three of five patients receiving phenylephrine. Increase in aortic pressure observed. Patient group: Arginine Vasopressin Group Number of patients in the group: 5 Treatment drug or substance: Arginine vasopressin Treatment dose: 0.03 U/kg Treatment frequency of administration: Single administration Treatment duration: Not specified Outcome: The ratio of pulmonary-to-systemic vascular resistance decreased in all five patients receiving arginine vasopressin. Increase in aortic pressure observed. Consistent decrease in the ratio of systolic pulmonary artery-to-aortic pressure observed. Patient group: Epinephrine Group (...) 12 Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 13

Slide 13 text

A few observations ... 13 ● ChatGPT manages to link information spread across paragraphs to each other, e.g. the drug + dose + outcome per patient group ● ChatGPT assumes that no mention of frequency of administration in the abstract, equals "single administration" ● ChatGPT is able to admit what it doesn't know, e.g. "Not specified" for the Treatment duration (which is, indeed, not specified in the abstract) ● Not bad for a quick prototype! Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 14

Slide 14 text

Disadvantages of using the ChatGPT web interface 14 ● No API ● No batching ● No robustness ● No data privacy ● No reproducibility ● Whole new meaning of "black box" Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 15

Slide 15 text

spacy-llm: Integrating LLMs into structured NLP pipelines 15 ● Support for external API's (OpenAI, Cohere, Anthropic, ...) as well as open-source models (via HuggingFace) ● Built-in support for various standard NLP tasks such as text classification, NER, relation extraction, text summarization, ... ● Relatively easy to implement your own, custom tasks ● https://github.com/explosion/spacy-llm Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 16

Slide 16 text

spacy-llm follows the main design principles of spaCy 16 ● Free, open-source library ● Designed for production use ● Focus on developer productivity ○ Built-in functionality to help you hit the ground running ○ Customizability & extensibility of the framework to implement anything your use-case needs ● Reproducibility of experiments by using a detailed config file ● Use rich data structures for results and metadata ● Break NLP challenge down into a pipeline of highly specific chained tasks ● https://github.com/explosion/spacy Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 17

Slide 17 text

Built-in zero-shot NER with spacy-llm 17 my_config.cfg [nlp] lang = "en" pipeline = ["llm"] batch_size = 128 [components] [components.llm] factory = "llm" [components.llm.model] @llm_models = "spacy.GPT-4.v2" [components.llm.task] @llm_tasks = "spacy.NER.v2" labels = ["Drug", "Dose"] from spacy_llm.util import assemble text = _read_trial(pmid=27144689) nlp = assemble(config_path) doc = nlp(text) my_script.py Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 18

Slide 18 text

Zero-shot NER with LLMs 18 ● Performance highly dependent on the label(s) ○ How commonly known these types of entities are ○ How descriptive & accurate the label text is, e.g. ■ "Dose" vs. "TreatmentDose" ■ "Drug" vs. "Chemical" etc ● Reproducibility can be tricky because the LLM's responses may vary ○ For classification (not generation) tasks, you'll typically want to set temperature to 0.0 ○ You can provide model-specific parameters in the config file: [components.llm.model] @llm_models = "spacy.GPT-4.v2" config = {"seed": 342, "temperature": 0.0} Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 19

Slide 19 text

Few-shot NER "Chain-of-thought" prompting 19 ● Based on the PromptNER paper by Ashok and Lipton (2023) ● Asking the LLM to explain its reasoning - giving it "tokens to think" ● Reimplemented in spacy-llm, available as spacy.NER.v3 ● Increase of 15 percentage points F-score on an internal use-case ● Works best when providing label definitions and examples - these allow you to tune the prompt towards the desired results Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 20

Slide 20 text

Built-in few-shot NER with spacy-llm (1) 20 my_config.cfg [components.llm.task] @llm_tasks = "spacy.NER.v3" labels = ["Drug", "Dose"] description = Entities are drugs or their doses. They can be uppercased, title-cased, or lowercased. Each occurrence of an entity in the text should be extracted. [components.llm.task.label_definitions] Drug = "A medicine or drug given to a patient as a treatment. Can be a generic name or brand name, e.g. paracetamol, Aspirin" Dose = "The measured quantity (dose) of a certain medicine given to patients, e.g. 1mg. This should exclude the drug name." [components.llm.task.examples] @misc = "spacy.FewShotReader.v1" path = "my_fewshot.json" Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 21

Slide 21 text

Built-in few-shot NER with spacy-llm (2) 21 my_fewshot.json "text": "The patient was given 1mg of paracetamol.", "spans": [ { "text": "paracetamol", "is_entity": true, "label": "Drug", "reason": "is a drug name, used as medication" }, { "text": "1mg", "is_entity": true, "label": "Dose", "reason": "is the quantity or dose of the given medication" }, { "text": "patient", "is_entity": false, "label": "==NONE==", "reason": "is a person, not a drug or dose" } ] Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 22

Slide 22 text

Implementation of a custom task 22 my_task.py INSTRUCTION = "Summarize the following clinical trial in a structured fashion. (...)" @registry.llm_tasks("tutorial.TrialSummary.v1") def make_trial_task() -> "TrialSummaryTask": return TrialSummaryTask(INSTRUCTION) class TrialSummaryTask(LLMTask): def __init__(self, instruction: str): self.instruction = instruction def generate_prompts(self, docs): for doc in docs: yield self.instruction + "\n\n" + doc.text def parse_responses(self, docs): ... my_config.cfg [nlp] lang = "en" pipeline = ["llm"] batch_size = 128 [components] [components.llm] factory = "llm" [components.llm.model] @llm_models = "spacy.GPT-4.v2" [components.llm.task] @llm_tasks = "tutorial.TrialSummary.v1" Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 23

Slide 23 text

Output from the spacy-llm pipeline using GPT-4 Patient group: Group 1 Number of patients in the group: 5 Treatment drug or substance: Phenylephrine Treatment dose: 1 μg/kg Treatment frequency of administration: Single dose Treatment duration: Up to 10 minutes following study drug administration Outcome: The ratio of pulmonary-to-systemic vascular resistance decreased in three of five patients. An increase in aortic pressure was observed. Patient group: Group 2 Number of patients in the group: 5 Treatment drug or substance: Arginine vasopressin Treatment dose: 0.03 U/kg Treatment frequency of administration: Single dose Treatment duration: Up to 10 minutes following study drug administration Outcome: The ratio of pulmonary-to-systemic vascular resistance decreased in all five patients. An increase in aortic pressure was observed. Arginine vasopressin consistently resulted in a decrease in the ratio of systolic pulmonary artery-to-aortic pressure. Patient group: Group 3 (...) 23 Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 24

Slide 24 text

Unfortunately, the output is still text ... 24 ● The "outcome" field contains full sentences ● Different ways LLMs express a "single" treatment frequency: ○ Administered once ○ Single administration ○ One-time dose ○ One time ○ Single dose ○ One-time administration ○ Once ● You still need to post-process the results to structured fields Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 25

Slide 25 text

From text to structured data: a pipeline approach 25 .txt file spaCy pipeline Doc object [nlp] lang = "en" pipeline = ["llm_ner", "entity_linker"] .txt file llm_ner Doc object entity_linker doc.ents ent.text ="phenylephrine" ent.label = "Drug" Sofie Van Landeghem - NLP Community - January 23, 2024 doc.ents ent.text ="phenylephrine" ent.label = "Drug" ent.kb_id = CHEMBL1215

Slide 26

Slide 26 text

From a trial text to structured output 26 .txt file llm_trial Doc object normalizer Parse responses { "Group 1": { "Number of patients": 5 "Drug": "Phenylephrine" "Dose": 1 μg/kg "Frequency": "Single dose" "Outcome":"The ratio of pulmonary-to-systemic vascular resistance decreased in three of five patients." }, "Group 2": ... } Sofie Van Landeghem - NLP Community - January 23, 2024 { "Group 1": { "Number of patients": 5 "Drug": "CHEMBL1215" "Dose": 1 μg/kg "Frequency": 1/trial "Outcome": { "Ratio PVR to SVR": { "Decrease": 3 } } }, "Group 2": ... } entity_linker Standardize, summarize

Slide 27

Slide 27 text

Swap out the LLM backend 27 ● Closed-source LLM ○ Sometimes better accuracy out-of-the box ○ Service can be unreliable (time-out, rate limits, ...) ○ All data sent to a third party ○ Often costly ● Open-source LLM ○ Can be customized / fine-tuned ○ More reliable ○ Requires dedicated hardware ○ Data privacy [components.llm] factory = "llm" [components.llm.model] @llm_models = "spacy.GPT-4.v2" [components.llm] factory = "llm" [components.llm.model] @llm_models = "spacy.Mistral.v1" name = "Mistral-7B-v0.1" Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 28

Slide 28 text

Swap out a component architecture 28 ● LLM ○ Quick prototyping ○ Can be unreliable/unstable ○ Expensive ● Supervised Machine Learning ○ Manual annotation effort ○ Faster & more reliable inference ○ Train your own or source a pretrained model ○ Cost-efficient ● Rules/patterns ○ Manual effort & maintenance burden ○ Higher customizability & interpretability [components.my_ner] factory = "llm" [components.my_ner.task] @llm_tasks = "spacy.NER.v3" [components.my_ner] source = "en_core_sci_lg" name = "ner" [components.my_ner] factory = "span_ruler" annotate_ents = true [components.my_ner] factory = "ner" Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 29

Slide 29 text

Combine the best of both worlds (example) 29 ● Machine learning textcat model ○ Identifies topics of sentences, paragraphs or full documents ○ Classifies the document as relevant or not (e.g. "Clinical trial abstract" or not?) ● LLM ○ Only processes those documents that were deemed relevant in the previous step ○ Avoid inducing unnecessary costs .txt file textcat Doc object llm Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 30

Slide 30 text

Annotations are still important 30 ● Evaluation ○ Every project requires at least a representative evaluation set! ○ Measure performance of single components ○ Measure performance of the full pipeline end-to-end ○ Measure progress while changing/fine-tuning the pipeline ● Training a supervised model ○ Smaller and specialized models can be more cost efficient ● Tuning an LLM ○ Providing "difficult" examples as few-shot examples in the prompt ○ Actually running a fine tuning learning step of your LLM Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 31

Slide 31 text

LLM-assisted annotation (NER) 31 Curate zero-shot LLM predictions Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 32

Slide 32 text

LLM-assisted annotation (REL) 32 Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 33

Slide 33 text

Summary 33 ● LLMs are great for quick prototyping and for bootstrapping annotation ● NLP solutions need to balance various metrics including accuracy, reliability, maintainability, customizability and cost ○ Mix and match LLMs with supervised models or rule-based components ○ spaCy pipelines are very versatile ○ Easily swap out one component while keeping other components in the pipeline the same ● spacy-llm lets you easily integrate LLMs into structured NLP pipelines ○ Swap out backends easily, switching between closed-source LLMs (API) and open-source ones ○ Use built-in standard NLP tasks ○ Write your own custom task, fine-tune the prompt, etc Sofie Van Landeghem - NLP Community - January 23, 2024

Slide 34

Slide 34 text

Thanks! 34 ● Contact ○ sofi[email protected] ○ http://www.oxykodit.com ○ https://www.linkedin.com/in/sofievanlandeghem/ ○ https://twitter.com/OxyKodit ○ https://explosion.ai/tailored-solutions ● Resources ○ https://github.com/explosion/spacy-llm/ ○ https://spacy.io/usage/large-language-models ○ https://prodi.gy/docs/large-language-models Core maintainer of spaCy @ Explosion NLP and ML freelancer @ OxyKodit Sofie Van Landeghem - NLP Community - January 23, 2024