Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Legal NLP - Breaking the Legal Language Barrier ?

Legal NLP - Breaking the Legal Language Barrier ?

LegalNLP - Breaking the Legal Language Barrier ?
Short Presentation at Stanford CodeX 2022 Future Law Conference

Featured Paper --
Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Martin Katz & Nikolaos Aletras, LexGLUE: A Benchmark Dataset for Legal Language Understanding in English, In Proceedings of the 60th Annual Meeting of the ACL - Association for Computational Linguistics (2022 Forthcoming)

arXiv Version - https://arxiv.org/abs/2110.00976

We would welcome your contributions and feedback regarding adding additional tasks!

Daniel Martin Katz

April 08, 2022
Tweet

More Decks by Daniel Martin Katz

Other Decks in Science

Transcript

  1. Legal NLP—Breaking the Legal Language Barrier?
    Dirk Hartung & Daniel Martin Katz
    Presentation at Stanford CodeX Future Law 2022—Law, Education and Experience Talks (LEX)

    View full-size slide

  2. Daniel Martin Katz
    Illinois Tech Chicago-Kent
    Bucerius Law School
    Stanford CodeX
    Dirk Hartung
    Bucerius Law School
    Stanford CodeX
    Your Presenters

    View full-size slide

  3. A LEGAL COMPLEXITY PICTURE
    LAW LAW LAND
    Featuring
    NATURAL LANGUAGE and
    DOMAIN-SPECIFIC JARGON
    Where Natural Language is the Coin of Realm …

    View full-size slide

  4. The Encoding of this
    Neural Net Begins Early On …
    Indeed, many of my law
    students consider their initial
    foray into the field … as an
    exercise in learning a ‘new
    language’
    Law School as a Neural Network Encoder
    🧠

    View full-size slide

  5. Specialized
    Dictionaries
    In Support of the Linguistic Immersion Program,
    We Offer Our Students a Steady Diet of Language
    Text Based
    Summaries
    Case
    Books

    View full-size slide

  6. Law / Lawyering is (in
    part) an exercise in
    linguistic construction
    and interpretation
    But Law is Not Just About the
    Consumption of Natural Language

    View full-size slide

  7. Text Production at a Massive Scale
    these are just some of the
    legal work product being
    produced on a daily basis
    across the world’s various
    legal systems
    …are massive
    producers of
    text
    Lawyers
    Judges
    Regulators
    Briefs
    Memos
    Statutes
    Opinions
    Regulations
    Contracts

    View full-size slide

  8. Stanford CodeX
    TechIndex -
    1800+ Companies
    and Counting …
    Significant Growth in LegalTech Over the Last Decade
    techindex.law.stanford.edu

    View full-size slide

  9. There are a wide variety of
    companies and solutions which
    have been developed …
    Tools to help individuals &
    organizations navigate the scale
    and complexity of the law …
    Helping Navigate the Scale and Complexity of the Law
    Perspectives on Legal Complexity from
    some of our other work

    View full-size slide

  10. Many (most) of these software
    offerings had to have some
    account for natural language
    because …

    View full-size slide

  11. In Law, most roads lead to a
    document …
    And that document is very likely to
    be expressed in natural language …
    All Most Roads in Law Lead to a Document

    View full-size slide

  12. Despite laudable efforts to the move
    law away from natural language and
    toward code …
    We are unlikely to see natural language
    displaced as the method to encode the
    law for the foreseeable future …
    Law as Code, Code as Law and the
    (Stubborn) Persistence of Natural Language

    View full-size slide

  13. So in order to make progress on
    certain problems we are likely to have
    to confront natural language …
    Good News is that there is a subfield
    in CS / AI whose primary focus is at
    the intersection of language and
    computation …
    Some Good News from Comp Sci

    View full-size slide

  14. YOU
    ARE
    HERE
    Language Computer
    Science
    Natural Language Processing
    (Computational Linguistics)
    NLP is a Branch of AI
    NLP as a Branch of AI

    View full-size slide

  15. WHAT IS A ROUGH
    DEFINITION OF
    NLP ?

    View full-size slide

  16. “It is the Statistical
    Representation of
    Language …

    View full-size slide

  17. Semantic Methods
    (Fairly Difficult)
    Syntax Methods
    (Fairly Easy)
    Historically, Big Divide between Semantics and Syntax

    View full-size slide

  18. “There have been a series of
    clever approaches to
    backdoor into semantics* …
    (*while also being scalable)

    View full-size slide

  19. Semantic Methods
    (Fairly Difficult)
    Syntax Methods
    (Fairly Easy)
    Historically, Big Divide between Semantics and Syntax
    Quasi-Semantic
    Methods

    View full-size slide

  20. The
    Age of
    ‘Neural’
    NLP

    View full-size slide

  21. Word2Vec (2013)

    View full-size slide

  22. The GPT Trilogy
    2018 - Present

    View full-size slide

  23. Big Bird
    2021

    View full-size slide

  24. “Okay that is general
    NLP but what about
    ‘LEGAL NLP’ … ?

    View full-size slide

  25. PAST PRESENT FUTURE
    LEGAL NLP

    View full-size slide

  26. Historically there were a fairly
    limited number of commercial
    applications which leveraged
    LEGAL NLP…

    View full-size slide

  27. In this talk, Richard Susskind notes that when he
    entered the scene in the early 1980’s, there were
    fewer than 40 papers on AI+Law TOTAL
    (let alone NLP+Law)
    reinventlawchannel.com/richard-susskind-
    future-of-artificial-intelligence-and-law

    View full-size slide

  28. Both the AI + Law
    Conference and the Jurix
    conference thereafter
    began to focus on these
    topics …
    With a strong focus on
    academic topics such as
    legal argumentation /
    legal reasoning …

    View full-size slide

  29. The 2010’s is the Decade Where the Academic
    and Commercial Worlds Began to Really Collide

    View full-size slide

  30. But if we look at both the academic and
    commercial sphere, we still observe a
    fairly thin account for legal language …
    Certainly as compared to humans and
    expert lawyers ...
    But this not an uncommon issue across
    the NLP world

    View full-size slide

  31. The need to understand
    Sub-Dialects of English
    is a familiar problem …

    View full-size slide

  32. So the Scientific / Engineering task at hand is to
    improve the performance of Legal NLP Models …
    By further breaking down the legal language barrier
    By grafting broader NLP developments to Domain
    Specific Needs in Law

    View full-size slide

  33. PAST PRESENT FUTURE
    LEGAL NLP

    View full-size slide

  34. Present
    ● Most current NLP in Law Survey
    ○ Zhong et al.
    ○ ACL Main Conference 2020
    ○ Embedding/Symbol-based
    ● 3 applications
    ○ Judgment prediction
    ○ Case similarity
    ○ Legal Q’n’A
    ● A good starting point to
    understand the state of the field
    in 2020
    aclanthology.org/2020.acl-main.466.pdf

    View full-size slide

  35. Present
    ● Domain-Specific NLP toolkit
    ○ Bommarito et al.
    ○ 2018
    ○ NLTK for law
    ● Python package
    ○ Standard NLP capabilities
    ○ Legal information
    extraction
    ○ Embeddings and classifiers
    ○ Legal lexica
    arxiv.org/abs/1806.03688

    View full-size slide

  36. Present
    ● Led by Our Co-Author Ilias
    Chalkidis (who is one of the world’s
    most published Legal NLP experts)
    ● Legal-BERT is an effort to PreTrain
    on Legal Information
    ● 12 GB of diverse English legal text
    from several fields (e.g.,
    legislation, court cases, contracts)
    scraped from
    publicly available resources
    ● Demonstrated Utility of
    Pre-Training on Model Performance
    arxiv.org/abs/2010.02559

    View full-size slide

  37. Present
    ● Latest large data set contribution
    ○ Zheng et al
    ○ Stanford
    ○ ICAIL 2021
    ● The role of legal language domain
    specificity
    ○ Legal NLP tasks are too easy
    ○ Problems often not
    meaningful for practitioners
    ○ Task similarity determines
    meaningfulness of pretraining
    dl.acm.org/doi/abs/10.1145/3462757.3466088

    View full-size slide

  38. Inspiration
    ● A diverse dataset
    ○ Tasks
    ○ Set Size
    ○ Text Genres
    ○ Degrees of Difficulty
    ● A diagnostic dataset to
    evaluate and analyze model
    performance
    ● Public Leaderboard and
    Visualization
    gluebenchmark.com

    View full-size slide

  39. Inspiration (Super) GLUE

    View full-size slide

  40. Our LexGLUE Colleagues!
    buceri.us/lexglue

    View full-size slide

  41. Dataset Overview

    View full-size slide

  42. Model Overview

    View full-size slide

  43. Results per Data Set

    View full-size slide

  44. Overall Aggregated Scores

    View full-size slide

  45. How to engage?
    github.com/coastalcph/lex-glue

    View full-size slide

  46. PAST PRESENT FUTURE
    LEGAL NLP

    View full-size slide

  47. Future for LexGLUE
    Future for LegalNLP

    View full-size slide

  48. Where is LexGLUE going strategically and tech-wise ?
    Extend the benchmark to
    datasets in other languages
    Reduce legal restrictions
    inhibiting the creation of
    datasets
    Develop better anonymization
    tools to free data
    Add human evaluation to
    ground truth
    Integrated
    submission
    environment
    Automatic
    evaluation and
    Leaderboard
    updates
    Incorporate approaches for
    Transformer-based models and
    long documents
    Build models which
    leverage document
    structure
    Curate a
    large-scale
    legal
    pre-training
    corpus
    Create even
    larger legal
    language
    models

    View full-size slide

  49. Summary Points on the Future of LexGLUE
    ● Build Proper Infrastructure including a Leaderboard
    ● Extend the benchmark to datasets in other languages
    ● Expand to number of tasks
    ● Include More Difficult Tasks as part of a ‘Super LexGlue’
    ● Curate a large-scale legal pre-training corpus (Legal C4)
    ○ (more on this in a second)

    View full-size slide

  50. Future for LegalNLP

    View full-size slide

  51. As a law professor I live this dualism …
    Training Two Forms of Neural Networks
    between training my
    student’s neural network
    and working with these
    neural networks …

    View full-size slide

  52. Take a Look at the Information Diet
    And Compare this to the Info Diet of My Law Students

    View full-size slide

  53. How Might One Optimize this Pre-Training Diet ?
    No Real Pre-Training Diet Optimization has
    been undertaken thus far …
    What is the Ideal Pre-Training Mixture ?
    General Language
    Legal Language
    Which pre-training ‘diet’ will perform best ?
    On a Per Task Basis
    On a Cross /Overall Task Basis

    View full-size slide

  54. Daniel Martin Katz
    Illinois Tech Chicago-Kent
    Bucerius Law School
    Stanford CodeX
    Dirk Hartung
    Bucerius Law School
    Stanford CodeX
    Your Presenters

    View full-size slide

  55. Our LexGLUE Colleagues!
    buceri.us/lexglue

    View full-size slide

  56. Legal NLP—Breaking the Legal Language Barrier?
    Dirk Hartung & Daniel Martin Katz
    Presentation at Stanford CodeX Future Law 2022—Law, Education and Experience Talks (LEX)

    View full-size slide