Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Legal NLP - Breaking the Legal Language Barrier ?

Legal NLP - Breaking the Legal Language Barrier ?

LegalNLP - Breaking the Legal Language Barrier ?
Short Presentation at Stanford CodeX 2022 Future Law Conference

Featured Paper --
Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Martin Katz & Nikolaos Aletras, LexGLUE: A Benchmark Dataset for Legal Language Understanding in English, In Proceedings of the 60th Annual Meeting of the ACL - Association for Computational Linguistics (2022 Forthcoming)

arXiv Version - https://arxiv.org/abs/2110.00976

We would welcome your contributions and feedback regarding adding additional tasks!

Daniel Martin Katz
PRO

April 08, 2022
Tweet

More Decks by Daniel Martin Katz

Other Decks in Science

Transcript

  1. Legal NLP—Breaking the Legal Language Barrier?
    Dirk Hartung & Daniel Martin Katz
    Presentation at Stanford CodeX Future Law 2022—Law, Education and Experience Talks (LEX)

    View Slide

  2. Daniel Martin Katz
    Illinois Tech Chicago-Kent
    Bucerius Law School
    Stanford CodeX
    Dirk Hartung
    Bucerius Law School
    Stanford CodeX
    Your Presenters

    View Slide

  3. A LEGAL COMPLEXITY PICTURE
    LAW LAW LAND
    Featuring
    NATURAL LANGUAGE and
    DOMAIN-SPECIFIC JARGON
    Where Natural Language is the Coin of Realm …

    View Slide

  4. The Encoding of this
    Neural Net Begins Early On …
    Indeed, many of my law
    students consider their initial
    foray into the field … as an
    exercise in learning a ‘new
    language’
    Law School as a Neural Network Encoder
    🧠

    View Slide

  5. Specialized
    Dictionaries
    In Support of the Linguistic Immersion Program,
    We Offer Our Students a Steady Diet of Language
    Text Based
    Summaries
    Case
    Books

    View Slide

  6. Law / Lawyering is (in
    part) an exercise in
    linguistic construction
    and interpretation
    But Law is Not Just About the
    Consumption of Natural Language

    View Slide

  7. Text Production at a Massive Scale
    these are just some of the
    legal work product being
    produced on a daily basis
    across the world’s various
    legal systems
    …are massive
    producers of
    text
    Lawyers
    Judges
    Regulators
    Briefs
    Memos
    Statutes
    Opinions
    Regulations
    Contracts

    View Slide

  8. Stanford CodeX
    TechIndex -
    1800+ Companies
    and Counting …
    Significant Growth in LegalTech Over the Last Decade
    techindex.law.stanford.edu

    View Slide

  9. There are a wide variety of
    companies and solutions which
    have been developed …
    Tools to help individuals &
    organizations navigate the scale
    and complexity of the law …
    Helping Navigate the Scale and Complexity of the Law
    Perspectives on Legal Complexity from
    some of our other work

    View Slide

  10. Many (most) of these software
    offerings had to have some
    account for natural language
    because …

    View Slide

  11. In Law, most roads lead to a
    document …
    And that document is very likely to
    be expressed in natural language …
    All Most Roads in Law Lead to a Document

    View Slide

  12. Despite laudable efforts to the move
    law away from natural language and
    toward code …
    We are unlikely to see natural language
    displaced as the method to encode the
    law for the foreseeable future …
    Law as Code, Code as Law and the
    (Stubborn) Persistence of Natural Language

    View Slide

  13. So in order to make progress on
    certain problems we are likely to have
    to confront natural language …
    Good News is that there is a subfield
    in CS / AI whose primary focus is at
    the intersection of language and
    computation …
    Some Good News from Comp Sci

    View Slide

  14. YOU
    ARE
    HERE
    Language Computer
    Science
    Natural Language Processing
    (Computational Linguistics)
    NLP is a Branch of AI
    NLP as a Branch of AI

    View Slide

  15. WHAT IS A ROUGH
    DEFINITION OF
    NLP ?

    View Slide

  16. “It is the Statistical
    Representation of
    Language …

    View Slide

  17. Semantic Methods
    (Fairly Difficult)
    Syntax Methods
    (Fairly Easy)
    Historically, Big Divide between Semantics and Syntax

    View Slide

  18. “There have been a series of
    clever approaches to
    backdoor into semantics* …
    (*while also being scalable)

    View Slide

  19. Semantic Methods
    (Fairly Difficult)
    Syntax Methods
    (Fairly Easy)
    Historically, Big Divide between Semantics and Syntax
    Quasi-Semantic
    Methods

    View Slide

  20. The
    Age of
    ‘Neural’
    NLP

    View Slide

  21. Word2Vec (2013)

    View Slide

  22. ELMO (2018)

    View Slide

  23. BERT (2019)

    View Slide

  24. The GPT Trilogy
    2018 - Present

    View Slide

  25. Big Bird
    2021

    View Slide

  26. View Slide

  27. View Slide

  28. “Okay that is general
    NLP but what about
    ‘LEGAL NLP’ … ?

    View Slide

  29. PAST PRESENT FUTURE
    LEGAL NLP

    View Slide

  30. Historically there were a fairly
    limited number of commercial
    applications which leveraged
    LEGAL NLP…

    View Slide

  31. In this talk, Richard Susskind notes that when he
    entered the scene in the early 1980’s, there were
    fewer than 40 papers on AI+Law TOTAL
    (let alone NLP+Law)
    reinventlawchannel.com/richard-susskind-
    future-of-artificial-intelligence-and-law

    View Slide

  32. Both the AI + Law
    Conference and the Jurix
    conference thereafter
    began to focus on these
    topics …
    With a strong focus on
    academic topics such as
    legal argumentation /
    legal reasoning …

    View Slide

  33. The 2010’s is the Decade Where the Academic
    and Commercial Worlds Began to Really Collide

    View Slide

  34. But if we look at both the academic and
    commercial sphere, we still observe a
    fairly thin account for legal language …
    Certainly as compared to humans and
    expert lawyers ...
    But this not an uncommon issue across
    the NLP world

    View Slide

  35. The need to understand
    Sub-Dialects of English
    is a familiar problem …

    View Slide

  36. So the Scientific / Engineering task at hand is to
    improve the performance of Legal NLP Models …
    By further breaking down the legal language barrier
    By grafting broader NLP developments to Domain
    Specific Needs in Law

    View Slide

  37. View Slide

  38. PAST PRESENT FUTURE
    LEGAL NLP

    View Slide

  39. Present
    ● Most current NLP in Law Survey
    ○ Zhong et al.
    ○ ACL Main Conference 2020
    ○ Embedding/Symbol-based
    ● 3 applications
    ○ Judgment prediction
    ○ Case similarity
    ○ Legal Q’n’A
    ● A good starting point to
    understand the state of the field
    in 2020
    aclanthology.org/2020.acl-main.466.pdf

    View Slide

  40. Present
    ● Domain-Specific NLP toolkit
    ○ Bommarito et al.
    ○ 2018
    ○ NLTK for law
    ● Python package
    ○ Standard NLP capabilities
    ○ Legal information
    extraction
    ○ Embeddings and classifiers
    ○ Legal lexica
    arxiv.org/abs/1806.03688

    View Slide

  41. Present
    ● Led by Our Co-Author Ilias
    Chalkidis (who is one of the world’s
    most published Legal NLP experts)
    ● Legal-BERT is an effort to PreTrain
    on Legal Information
    ● 12 GB of diverse English legal text
    from several fields (e.g.,
    legislation, court cases, contracts)
    scraped from
    publicly available resources
    ● Demonstrated Utility of
    Pre-Training on Model Performance
    arxiv.org/abs/2010.02559

    View Slide

  42. Present
    ● Latest large data set contribution
    ○ Zheng et al
    ○ Stanford
    ○ ICAIL 2021
    ● The role of legal language domain
    specificity
    ○ Legal NLP tasks are too easy
    ○ Problems often not
    meaningful for practitioners
    ○ Task similarity determines
    meaningfulness of pretraining
    dl.acm.org/doi/abs/10.1145/3462757.3466088

    View Slide

  43. Foundations

    View Slide

  44. Inspiration
    ● A diverse dataset
    ○ Tasks
    ○ Set Size
    ○ Text Genres
    ○ Degrees of Difficulty
    ● A diagnostic dataset to
    evaluate and analyze model
    performance
    ● Public Leaderboard and
    Visualization
    gluebenchmark.com

    View Slide

  45. Inspiration (Super) GLUE

    View Slide

  46. Our LexGLUE Colleagues!
    buceri.us/lexglue

    View Slide

  47. Dataset Overview

    View Slide

  48. Model Overview

    View Slide

  49. Results per Data Set

    View Slide

  50. Overall Aggregated Scores

    View Slide

  51. How to engage?
    github.com/coastalcph/lex-glue

    View Slide

  52. View Slide

  53. PAST PRESENT FUTURE
    LEGAL NLP

    View Slide

  54. Future for LexGLUE
    Future for LegalNLP

    View Slide

  55. Where is LexGLUE going strategically and tech-wise ?
    Extend the benchmark to
    datasets in other languages
    Reduce legal restrictions
    inhibiting the creation of
    datasets
    Develop better anonymization
    tools to free data
    Add human evaluation to
    ground truth
    Integrated
    submission
    environment
    Automatic
    evaluation and
    Leaderboard
    updates
    Incorporate approaches for
    Transformer-based models and
    long documents
    Build models which
    leverage document
    structure
    Curate a
    large-scale
    legal
    pre-training
    corpus
    Create even
    larger legal
    language
    models

    View Slide

  56. Summary Points on the Future of LexGLUE
    ● Build Proper Infrastructure including a Leaderboard
    ● Extend the benchmark to datasets in other languages
    ● Expand to number of tasks
    ● Include More Difficult Tasks as part of a ‘Super LexGlue’
    ● Curate a large-scale legal pre-training corpus (Legal C4)
    ○ (more on this in a second)

    View Slide

  57. Future for LegalNLP

    View Slide

  58. As a law professor I live this dualism …
    Training Two Forms of Neural Networks
    between training my
    student’s neural network
    and working with these
    neural networks …

    View Slide

  59. Take a Look at the Information Diet
    And Compare this to the Info Diet of My Law Students

    View Slide

  60. How Might One Optimize this Pre-Training Diet ?
    No Real Pre-Training Diet Optimization has
    been undertaken thus far …
    What is the Ideal Pre-Training Mixture ?
    General Language
    Legal Language
    Which pre-training ‘diet’ will perform best ?
    On a Per Task Basis
    On a Cross /Overall Task Basis

    View Slide

  61. Daniel Martin Katz
    Illinois Tech Chicago-Kent
    Bucerius Law School
    Stanford CodeX
    Dirk Hartung
    Bucerius Law School
    Stanford CodeX
    Your Presenters

    View Slide

  62. Our LexGLUE Colleagues!
    buceri.us/lexglue

    View Slide

  63. Legal NLP—Breaking the Legal Language Barrier?
    Dirk Hartung & Daniel Martin Katz
    Presentation at Stanford CodeX Future Law 2022—Law, Education and Experience Talks (LEX)

    View Slide