Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Legal NLP - Breaking the Legal Language Barrier ?

Legal NLP - Breaking the Legal Language Barrier ?

LegalNLP - Breaking the Legal Language Barrier ?
Short Presentation at Stanford CodeX 2022 Future Law Conference

Featured Paper --
Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Martin Katz & Nikolaos Aletras, LexGLUE: A Benchmark Dataset for Legal Language Understanding in English, In Proceedings of the 60th Annual Meeting of the ACL - Association for Computational Linguistics (2022 Forthcoming)

arXiv Version - https://arxiv.org/abs/2110.00976

We would welcome your contributions and feedback regarding adding additional tasks!


Daniel Martin Katz

April 08, 2022

More Decks by Daniel Martin Katz

Other Decks in Science


  1. Legal NLP—Breaking the Legal Language Barrier? Dirk Hartung & Daniel

    Martin Katz Presentation at Stanford CodeX Future Law 2022—Law, Education and Experience Talks (LEX)
  2. Daniel Martin Katz Illinois Tech Chicago-Kent Bucerius Law School Stanford

    CodeX Dirk Hartung Bucerius Law School Stanford CodeX Your Presenters

    and DOMAIN-SPECIFIC JARGON Where Natural Language is the Coin of Realm …
  4. The Encoding of this Neural Net Begins Early On …

    Indeed, many of my law students consider their initial foray into the field … as an exercise in learning a ‘new language’ Law School as a Neural Network Encoder 🧠
  5. Specialized Dictionaries In Support of the Linguistic Immersion Program, We

    Offer Our Students a Steady Diet of Language Text Based Summaries Case Books
  6. Law / Lawyering is (in part) an exercise in linguistic

    construction and interpretation But Law is Not Just About the Consumption of Natural Language
  7. Text Production at a Massive Scale these are just some

    of the legal work product being produced on a daily basis across the world’s various legal systems …are massive producers of text Lawyers Judges Regulators Briefs Memos Statutes Opinions Regulations Contracts
  8. Stanford CodeX TechIndex - 1800+ Companies and Counting … Significant

    Growth in LegalTech Over the Last Decade techindex.law.stanford.edu
  9. There are a wide variety of companies and solutions which

    have been developed … Tools to help individuals & organizations navigate the scale and complexity of the law … Helping Navigate the Scale and Complexity of the Law Perspectives on Legal Complexity from some of our other work
  10. Many (most) of these software offerings had to have some

    account for natural language because …
  11. In Law, most roads lead to a document … And

    that document is very likely to be expressed in natural language … All Most Roads in Law Lead to a Document
  12. Despite laudable efforts to the move law away from natural

    language and toward code … We are unlikely to see natural language displaced as the method to encode the law for the foreseeable future … Law as Code, Code as Law and the (Stubborn) Persistence of Natural Language
  13. So in order to make progress on certain problems we

    are likely to have to confront natural language … Good News is that there is a subfield in CS / AI whose primary focus is at the intersection of language and computation … Some Good News from Comp Sci
  14. YOU ARE HERE Language Computer Science Natural Language Processing (Computational

    Linguistics) NLP is a Branch of AI NLP as a Branch of AI

  16. “It is the Statistical Representation of Language …

  17. Semantic Methods (Fairly Difficult) Syntax Methods (Fairly Easy) Historically, Big

    Divide between Semantics and Syntax
  18. “There have been a series of clever approaches to backdoor

    into semantics* … (*while also being scalable)
  19. Semantic Methods (Fairly Difficult) Syntax Methods (Fairly Easy) Historically, Big

    Divide between Semantics and Syntax Quasi-Semantic Methods
  20. The Age of ‘Neural’ NLP

  21. Word2Vec (2013)

  22. ELMO (2018)

  23. BERT (2019)

  24. The GPT Trilogy 2018 - Present

  25. Big Bird 2021

  26. None
  27. None
  28. “Okay that is general NLP but what about ‘LEGAL NLP’

    … ?

  30. Historically there were a fairly limited number of commercial applications

    which leveraged LEGAL NLP…
  31. In this talk, Richard Susskind notes that when he entered

    the scene in the early 1980’s, there were fewer than 40 papers on AI+Law TOTAL (let alone NLP+Law) reinventlawchannel.com/richard-susskind- future-of-artificial-intelligence-and-law
  32. Both the AI + Law Conference and the Jurix conference

    thereafter began to focus on these topics … With a strong focus on academic topics such as legal argumentation / legal reasoning …
  33. The 2010’s is the Decade Where the Academic and Commercial

    Worlds Began to Really Collide …
  34. But if we look at both the academic and commercial

    sphere, we still observe a fairly thin account for legal language … Certainly as compared to humans and expert lawyers ... But this not an uncommon issue across the NLP world
  35. The need to understand Sub-Dialects of English is a familiar

    problem …
  36. So the Scientific / Engineering task at hand is to

    improve the performance of Legal NLP Models … By further breaking down the legal language barrier By grafting broader NLP developments to Domain Specific Needs in Law
  37. None

  39. Present • Most current NLP in Law Survey ◦ Zhong

    et al. ◦ ACL Main Conference 2020 ◦ Embedding/Symbol-based • 3 applications ◦ Judgment prediction ◦ Case similarity ◦ Legal Q’n’A • A good starting point to understand the state of the field in 2020 aclanthology.org/2020.acl-main.466.pdf
  40. Present • Domain-Specific NLP toolkit ◦ Bommarito et al. ◦

    2018 ◦ NLTK for law • Python package ◦ Standard NLP capabilities ◦ Legal information extraction ◦ Embeddings and classifiers ◦ Legal lexica arxiv.org/abs/1806.03688
  41. Present • Led by Our Co-Author Ilias Chalkidis (who is

    one of the world’s most published Legal NLP experts) • Legal-BERT is an effort to PreTrain on Legal Information • 12 GB of diverse English legal text from several fields (e.g., legislation, court cases, contracts) scraped from publicly available resources • Demonstrated Utility of Pre-Training on Model Performance arxiv.org/abs/2010.02559
  42. Present • Latest large data set contribution ◦ Zheng et

    al ◦ Stanford ◦ ICAIL 2021 • The role of legal language domain specificity ◦ Legal NLP tasks are too easy ◦ Problems often not meaningful for practitioners ◦ Task similarity determines meaningfulness of pretraining dl.acm.org/doi/abs/10.1145/3462757.3466088
  43. Foundations

  44. Inspiration • A diverse dataset ◦ Tasks ◦ Set Size

    ◦ Text Genres ◦ Degrees of Difficulty • A diagnostic dataset to evaluate and analyze model performance • Public Leaderboard and Visualization gluebenchmark.com
  45. Inspiration (Super) GLUE

  46. Our LexGLUE Colleagues! buceri.us/lexglue

  47. Dataset Overview

  48. Model Overview

  49. Results per Data Set

  50. Overall Aggregated Scores

  51. How to engage? github.com/coastalcph/lex-glue

  52. None

  54. Future for LexGLUE Future for LegalNLP

  55. Where is LexGLUE going strategically and tech-wise ? Extend the

    benchmark to datasets in other languages Reduce legal restrictions inhibiting the creation of datasets Develop better anonymization tools to free data Add human evaluation to ground truth Integrated submission environment Automatic evaluation and Leaderboard updates Incorporate approaches for Transformer-based models and long documents Build models which leverage document structure Curate a large-scale legal pre-training corpus Create even larger legal language models
  56. Summary Points on the Future of LexGLUE • Build Proper

    Infrastructure including a Leaderboard • Extend the benchmark to datasets in other languages • Expand to number of tasks • Include More Difficult Tasks as part of a ‘Super LexGlue’ • Curate a large-scale legal pre-training corpus (Legal C4) ◦ (more on this in a second)
  57. Future for LegalNLP

  58. As a law professor I live this dualism … Training

    Two Forms of Neural Networks between training my student’s neural network and working with these neural networks …
  59. Take a Look at the Information Diet And Compare this

    to the Info Diet of My Law Students
  60. How Might One Optimize this Pre-Training Diet ? No Real

    Pre-Training Diet Optimization has been undertaken thus far … What is the Ideal Pre-Training Mixture ? General Language Legal Language Which pre-training ‘diet’ will perform best ? On a Per Task Basis On a Cross /Overall Task Basis
  61. Daniel Martin Katz Illinois Tech Chicago-Kent Bucerius Law School Stanford

    CodeX Dirk Hartung Bucerius Law School Stanford CodeX Your Presenters
  62. Our LexGLUE Colleagues! buceri.us/lexglue

  63. Legal NLP—Breaking the Legal Language Barrier? Dirk Hartung & Daniel

    Martin Katz Presentation at Stanford CodeX Future Law 2022—Law, Education and Experience Talks (LEX)