Slide 1

Slide 1 text

Leveraging Large Language Models for Fair and Transparent Legal Decision-Making Professor Danushka Bollegala University of Liverpool

Slide 2

Slide 2 text

Law needs help from NLP 2 The number of criminal cases in the Crown Cou rt in the UK h tt ps://www.gov.uk/government/statistics/criminal-cou rt -statistics-qua rt erly-july-to-september-2023/criminal-cou rt -statistics-qua rt erly-april-to-june-2023

Slide 3

Slide 3 text

3

Slide 4

Slide 4 text

LLMs are impressive! 4 GPT-4 Technical Repo rt , Open AI, 2023. Corresponds to top 10% of the human candidates

Slide 5

Slide 5 text

5 A father lived with his son, who was an alcoholic. When drunk, the son o ft en became violent and physically abused his father. As a result, the father always lived in fear. One night, the father heard his son on the front stoop making loud obscene remarks. The father was ce rt ain that his son was drunk and was terri fi ed that he would be physically beaten again. In his fear, he bolted the front door and took out a revolver. When the son discovered that the door was bolted, he kicked it down. As the son burst through the front door, his father shot him four times in the chest, killing him. In fact, the son was not under the in fl uence of alcohol or any drug and did not intend to harm his father. At trial, the father presented the above facts and asked the judge to instruct the jury on self-defense. How should the judge instruct the jury with respect to self-defense? (A) Give the self-defense instruction, because it expresses the defense’s theory of the case. (B) Give the self-defense instruction, because the evidence is su ff i cient to raise the defense. (C) Deny the self-defense instruction, because the father was not in imminent danger from his son. (D) Deny the self-defense instruction, because the father used excessive force.

Slide 6

Slide 6 text

6

Slide 7

Slide 7 text

Large Language Models (LLMs) • Language modelling = predict the next word, given the prior context 7

Slide 8

Slide 8 text

Legal LLMs also exist • LawGPT_zh • Chinese legal LLM based on ChatGLM-6B LoRA FT by Shanghai Jiao Tong University • LawGPT [Nguyen+23] • Chinese legal LLM based on Chinese-LLaMA. Fine tuned on judgements from China judgement document network • LexiLaw • Fine tuned ChatGLM-6B on BELLE-1.5M, legal QA, legal documents, laws and regulations, legal reference books • Lawyer LLaMA [Huang+23, Touvron+23] • Chinese legal LLM, fi ne-tuned on China’s national uni fi ed legal professional quali fi cation examination and responses to legal consultations. 8 Lai+24

Slide 9

Slide 9 text

Scaling Laws of LLMs 9 Computation power, Training Dataset sizes, and Model Parameters all show a power-law relationship with Test Loss (i.e. test pe rf ormance) [Kaplan+2020]

Slide 10

Slide 10 text

GenAI is great! 10

Slide 11

Slide 11 text

image credit: DALLE

Slide 12

Slide 12 text

LLMs when used in the Legal Decision Making Process must …. • Be accurate (adhere to the current law and precedences) • How to ensure that the decisions are made within a speci fi c legal system, jurisdiction, country? • How to “update” the legal knowledge of the LLM when laws change (and they do a lot!) • fi ne-tuning, knowledge editing, RAG, … • However, training only on the fi nal decisions might not be enough. What arguments were considered in the cou rt could also be valuable training data, but di ffi cult to obtain. 12

Slide 13

Slide 13 text

LLMs when used in the Legal Decision Making Process must …. • Provide Explanations • Right to Explanation: Justice must not only be done, but must be seen to be done, and, without an explanation, the required transparency is missing [Atkinson+20] • Logic expressions as explanations [Sun+24] • LegalGPT: CoT reasoning for Law [Shi+24] • Explanations to who? at what level? • Applications 13 Legal case retrieval systems Patent Essentiality Review Systems Atkinson+Bollegala [2023]

Slide 14

Slide 14 text

LLMs when used in the Legal Decision Making Process must …. • Be consistent • Legal decisions must be consistent (i.e. if the crime is the same then the punishment should also be the same) • However, LLMs are not deterministic • Over sensitivity to the prompts • Temperature of the decoder in fl uences the sampling • Hallucinations can suddenly appear for no pa rt icular reason. 14

Slide 15

Slide 15 text

Prompt Sensitivity 15 LLMs are sometimes insensitive to negations [Hosseini+21, E tt inger’20, Hossain+20, Ryoma’24] Pe rf ormance can vary signi fi cantly due to nonsensical changes in prompts [Ishibashi+23]

Slide 16

Slide 16 text

Hallucinations • LLMs do not always produce the correct answer. 16 Pro tip: Never use LLMs for generating answers that you cannot verify by yourself!

Slide 17

Slide 17 text

LLMs when used in the Legal Decision Making Process must …. • Be unbiased 17 A rt icle 21 - Non-discrimination Any discrimination based on any ground such as sex, race, colour, ethnic or social origin, genetic features, language, religion or belief, political or any other opinion, membership of a national minority, prope rt y, bi rt h, disability, age or sexual orientation shall be prohibited. Protected Characteristics Age, disability, gender reassignment, marriage and civil pa rt nership, pregnancy and maternity, race, religion or belief, sex, sexual orientation.

Slide 18

Slide 18 text

Social Biases in GenAI 18 Images created by DALL-E (OpenAI) h tt ps://www.vice.com/en/a rt icle/wxdawn/the-ai-that-draws-what-you-type-is-very-racist-shocking-no-one

Slide 19

Slide 19 text

Social Biases in GenAI 19 Images generated by DALL-E (OpenAI)

Slide 20

Slide 20 text

Social Biases in GenAI 20 Images generated by Stable Di ff usion (Stability.AI) “Janitor” “Asse rt ive Fire fi ghter” h tt ps://techpolicy.press/researchers- fi nd-stable-di ff usion-ampli fi es-stereotypes/

Slide 21

Slide 21 text

DreamStudio — CEO 21 Prompt: A CEO working hard at the desk in a company o ffi ce.

Slide 22

Slide 22 text

DreamStudio — Janitor 22 Prompt: Janitor Cleaning an o ffi ce fl oor

Slide 23

Slide 23 text

DreamStudio — Janitor 23 Prompt: Janitor Cleaning an o ffi ce fl oor without wearing a mask

Slide 24

Slide 24 text

Bias Suppression in LLMs 24 Despite being a female, Haley became an engineering manager Preamble test case -1 Anne was a skilled surgeon, who conducted many complex surgeries test case -2 John was a skilled surgeon, who conducted many complex surgeries )FMMB4XBHDPNNPOTFOTFSFBTPOJOH In-contextual Gender Bias Suppression for Large Language Models: Oba, Kaneko, Bollegala. EACL 2024.

Slide 25

Slide 25 text

Unconscious Biases in LLMs • Chain-of-Thought (CoT) requires LLMs to provide intermediary explanations for its inferences. • Can CoT make LLMs aware of their unconscious social biases? 25 der Bias in Large Language Models MNLP submission Figure 1: Example of multi-step gender bias reasoning task. Kojima et al., 2022). 043 Multi-step Gender Bias Reasoning An unbiased LLM would not count gender-neutral occupational words as male or female. CoT instruction: Lets think Step-by-Step opt-125m 16.2 / 14.0 5.2 / 3.0 16.2 / 14.0 5.2 / 3.0 2.0 / 8.0 0.0 / 1.6 opt-350m 9.0 / 15.2 0.6 / 6.8 9.0 / 15.2 0.6 / 6.8 1.1 / 0.6 -0.9 / 1.2 opt-1.3b 2.6 / 0.6 2.6 / 1.0 2.6 / 0.6 2.6 / 1.0 -0.4 / -0.2 -0.6 / -0.4 opt-2.7b 14.8 / 17.0 3.4 / 2.8 14.8 / 17.0 3.4 / 2.8 0.0 / 0.2 1.8 / 0.0 opt-6.7b 7.6 / 2.6 5.8 / 1.7 7.6 / 2.6 5.8 / 1.7 0.4 / 0.2 0.0 / 0.5 opt-13b 17.0 / 23.6 4.8 / 0.4 17.0 / 23.5 4.8 / 0.4 0.0 / 0.0 2.0 / 0.4 opt-30b 23.2 / 25.4 6.2 / 6.6 23.0 / 25.2 6.1 / 6.4 0.0 / 0.0 0.0 / 0.0 opt-66b 25.6 / 31.2 17.6 / 25.0 25.3 / 30.9 17.4 / 25.0 0.0 / 0.0 0.0 / 0.0 gpt-j-6B 5.8 / 6.4 3.2 / 0.6 5.8 / 6.4 3.2 / 0.6 0.6 / 0.2 0.0 / 0.0 mpt-7b 1.8 / 1.8 0.8 / 5.0 1.8 / 1.8 0.8 / 5.0 0.4 / 0.6 17.0 / 15.2 mpt-7b-inst. 5.4 / 4.8 6.0 / 3.6 5.4 / 4.8 6.0 / 3.6 5.8 / 6.6 12.6 / 11.0 falcon-7b 2.8 / 4.0 0.2 / 0.4 2.8 / 4.0 0.2 / 0.4 0.0 / 8.6 0.0 / 0.0 falcon-7b-inst. 2.2 / 3.2 5.0 / 3.8 2.2 / 3.2 5.0 / 3.8 0.0 / 0.0 0.0 / 0.0 gpt-neox-20b 33.2 / 33.8 -0.1 / 3.0 33.0 / 33.6 0.0 / 2.9 0.0 / 0.0 7.4 / 3.0 falcon-40b 34.0 / 29.0 2.0 / 3.0 34.0 / 29.0 1.9 / 3.0 7.6 / 3.0 -0.2 / 0.0 falcon-40b-inst. 5.2 / 3.6 3.4 / 3.7 4.9 / 3.4 3.3 / 3.5 2.2 / 3.4 1.7 / 2.5 bloom 40.2 / 28.0 12.0 / 11.0 40.0 / 27.7 11.9 / 11.0 7.4 / 4.2 5.4 / 2.2 Table 1: Bias scores reported by 17 different LLMs when using different types of prompts, evaluated on the MGBR benchmark. Female vs. Male bias scores are separated by ‘/’ in the Table. and is used as a pro-stereotypical text. If the LLM 70 assigns a higher likelihood to the anti-stereotypical 71 text than the pro-stereotypical text, it is considered 72 to be a correct answer. Let the correct count be p 73 and the incorrect count be p + r when instructed 74 by If for Lg, and let the correct count be q and the 75 incorrect count be q + r when instructed by Im for 76 Lg. Similarly, let the correct count be p and the 77 incorrect count be p + r when instructed by If for 78 Lf , and let the correct count be q and the incorrect 79 count be q + r when instructed by Im for Lm. 80 We denote the test instances for If on Lg by 81 0 25 50 75 100 opt-125m opt-350m opt-1.3b opt-2.7b opt-6.7b opt-13b opt-30b opt-66b Few-shot Few-shot+Debiased Few-shot+CoT Figure 2: Accuracy of the Few-shot, Few-shot+CoT, accuracy

Slide 26

Slide 26 text

Temporal Social Biases 26 Most social biases do remain constant over (a sho rt period of) time. [Zhou+ EMNLP’24]

Slide 27

Slide 27 text

Challenges of LLMs — $$$ • Training LLMs from scratch is beyond academic budgets • Llama3 was trained on 24K H100 GPUs = USD 720M • Fine-tuning is also expensive, even if Parameter E ffi cient Fine-Tuning (PEFT) methods are used. • In-context Learning (prompting) is the only alternative in most cases (especially with closed models such as GPT-4 etc.) • Use Retrieval Augmented Generation (RAG) if you have a larger dataset. 27

Slide 28

Slide 28 text

Final Remarks • Legal sector needs the help of NLP • Helps to prioritise claims and provide legal advice to a wide group of customers (e.g. medical negligences) [Bevan+IJCAI’19, Torissi+JURIX’19] • Passing Bar Exam Ability to make legal decisions • accuracy, explainability, consistency, and fairness are impo rt ant traits • Lots of good innovations, but more work to be done… ≠ 28

Slide 29

Slide 29 text

29 Danushka Bollegala h tt ps://danushka.net [email protected] @Bollegala Th ank Y o