Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData Global 2022: Spancat

PyData Global 2022: Spancat

Named entity recognition models might not be able to handle a wide variety of spans, but Spancat certainly can! Within our open-source library for NLP, spaCy, we’ve created a NER model to handle overlapping and arbitrary text spans. Dive into named entity recognition, its limitations, and how we’ve solved them with a solution-focused talk and practical applications.

Victoria Slocum

December 05, 2022
Tweet

More Decks by Victoria Slocum

Other Decks in Programming

Transcript

  1. with Spancat
    Entities Entities
    within
    Entities
    within
    1

    View full-size slide

  2. Sam Bankman-Fried announced he would resign as
    CEO of FTX amid the crisis in late 2022.
    DATE
    PERSON
    ORG
    This drug helped my joint pain and increased my ability to be active.
    However, now I am starting to feel dizzy and get headaches.
    2

    View full-size slide

  3. Sam Bankman-Fried announced he would resign as
    CEO of FTX amid the crisis in late 2022.
    DATE
    PERSON
    ORG
    4
    This drug helped my joint pain and increased my ability to be active.
    However, now I am starting to feel dizzy and get headaches.
    COND COND
    COND
    BENEFIT
    ADE ADE
    BENEFIT

    View full-size slide

  4. Ç What is ?
    Ç What are the of NER
    Ç - and how it’s different from NER
    named entity recognition
    limitations
    Spancat
    5

    View full-size slide

  5. Sam Bankman-Fried announced he would wind down
    operations at Alameda Research and resigned as CEO
    of FTX amid the crisis in late 2022.
    DATE
    PERSON
    ORG
    ORG
    6

    View full-size slide

  6. >>> B-ORG
    Sam Bankman-Fried was the CEO of FTX
    B-PER I-PER
    >>> [‘B’,‘I’,‘O’,‘O’,‘O’,‘O’,‘B’]
    0
    1
    2
    3
    4
    import
    doc
    spacy

    nlp = spacy. ( )

    doc = nlp( )

    print([ .ent_iob_ for in ])
    load
    token token
    "en_core_web_sm"
    "Sam Bankman-Fried was the CEO of FTX"
    7

    View full-size slide

  7. This drug helped my joint pain and increased my ability to be active.
    However, now I am starting to feel dizzy and get headaches.
    COND COND
    COND
    8

    View full-size slide

  8. This is great for but it also caused
    joint pain headaches
    it also caused headaches
    This is great for joint pain
    More on this:

    https://explosion.ai/blog/healthsea
    Text Classification
    Text Classification
    9

    View full-size slide

  9. This drug helped my joint pain and increased my ability to be active.
    However, now I am starting to feel dizzy and get headaches.
    COND COND
    COND
    BENEFIT
    ADE ADE
    BENEFIT
    10

    View full-size slide

  10. https://spacy.io/api/spancategorizer
    Spancat
    COMPONENT
    11

    View full-size slide

  11. More on this:

    https://explosion.ai/blog/spacy-design-concepts
    Customizability without
    compromising the
    developer experience
    f so you can
    customize to your specific us†
    f and easy to get
    started with a project
    f configuration and
    implementation
    swappable components
    sensible defaults
    transparent
    12

    View full-size slide

  12. # Construction via add_pipe with default model

    =
    # Construction via add_pipe with custom model

    =
    =
    spancat nlp. ( )


    config { : { : }}

    parser nlp. ( , config=config)
    add_pipe
    add_pipe
    "spancat"
    "model" "@architectures" "my_spancat"
    "spancat"
    € componen
    € different ways to get starte
    € part of the text
    trainable
    processing pipeline
    Text Doc
    nlp
    tokenizer tagger parser spancat ...
    13

    View full-size slide

  13. [components.spancat.suggester]

    @misc

    sizes
    =
    =
    "spacy.ngram_suggester.v1"
    [1,2,3]
    config.cfg
    [components.spancat.suggester]

    @misc

    max_output
    =
    =
    "custom_suggester.v1"
    10
    config.cfg
    Generate a
    https://spacy.io/usage/training#quickstart
    config:

    s the , includes
    all settings and records all default€
    s by
    swapping out component€
    s preset with to get
    you started
    single source of truth
    customize the architecture
    sensible defaults
    14

    View full-size slide

  14. 15
    import
    from import
    from import
    spacy

    spacy displacy

    spacy.tokens Span


    text

    nlp spacy.blank( )

    doc nlp(text)


    doc.spans[ ] [

    Span(doc, , , ),

    Span(doc, , , ),

    ]


    displacy.serve(doc, style )
    =
    =
    =
    =
    =
    "Welcome to the Bank of China."
    "en"
    "sc"
    "ORG"
    "GPE"
    "span"
    3 6
    5 6
    displaCy

    https://spacy.io/usage/visualizers

    View full-size slide

  15. This has helped my joint pain.
    COND
    Classifier
    label: condition
    This has
    COND 0.1
    has helped
    COND 0.1
    helped my
    COND 0.1
    my joint
    COND 0.25
    joint pain
    COND 0.99
    Suggester
    n-gram 2
    This has has helped helped my my joint joint pain
    16

    View full-size slide

  16. joint pain
    [3,2] [1,8]
    Tok2vec
    Suggested

    span
    17

    View full-size slide

  17. joint pain
    [3,2] [1,8]
    Tok2vec
    Pooling
    Suggested

    span
    [3,2] [1,8] [2,5] [3,8]
    First Last Mean Max
    17

    View full-size slide

  18. joint pain
    [3,2] [1,8]
    Tok2vec
    Pooling
    Scoring
    Suggested

    span
    COND: 0.99 EFFECT: 0.06
    [3,2] [1,8] [2,5] [3,8]
    First Last Mean Max
    17

    View full-size slide

  19. Subtree suggester
    syntactic dependencies
    noun chunk iterator
    Chunk suggester
    full sentences
    Sentence suggester
    Learn more:

    github.com/explosion/ #span-finder
    spacy-experimental
    swappable
    suggestor
    functions
    18

    View full-size slide

  20. Subtree suggester
    syntactic dependencies
    noun chunk iterator
    Chunk suggester
    full sentences
    Sentence suggester
    a certain amount of
    tokens
    n-gram suggester
    Learn more:

    github.com/explosion/ #span-finder
    spacy-experimental
    swappable
    suggestor
    functions
    18

    View full-size slide

  21. Subtree suggester
    syntactic dependencies
    noun chunk iterator
    Chunk suggester
    full sentences
    Sentence suggester
    a certain amount of
    tokens
    n-gram suggester
    machine learning
    approac‰
    learns start and end
    tokens
    Learn more:

    github.com/explosion/ #span-finder
    spacy-experimental
    SpanFinder
    swappable
    suggestor
    functions
    18

    View full-size slide

  22. >>>
    , , , , ,

    , , , ,
    , ,
    [

    Sam Bankman-Fried CEO of FTX
    Sam Bankman-Fried Bankman-Fried CEO CEO of of FTX

    Sam Bankman-Fried CEO Bankman-Fried CEO of CEO of FTX

    ]
    0
    1
    2
    3
    4
    5
    6
    7
    8
    import
    from import
    for in
    spacy

    spacy. registry


    nlp = spacy.
    doc = nlp( )

    build_suggester = registry.misc. (
    suggester = build_suggester( =[1, 2, 3])

    util
    blank
    get
    sizes
    start:end data
    ( )

    )

    spans = [doc[ ] (start, end) suggester([doc]). ]
    “en”
    "Sam Bankman-Fried, CEO of FTX."
    “spacy.ngram_range_suggester.v1”
    19

    View full-size slide

  23. Explicit control of candidate span
    & via suggester function
    & bias your model towards precision or recall
    20

    View full-size slide

  24. 20
    Explicit control of candidate span
    Access to confidence score
    0 via suggester function
    0 bias your model towards precision or recall


    0 label probabilities over the whole spa'
    0 includes the full context of the span

    View full-size slide

  25. 20
    Explicit control of candidate span
    Access to confidence score
    Less edge-sensitivitÈ
    9 via suggester function
    9 bias your model towards precision or recall


    9 label probabilities over the whole spa6
    9 includes the full context of the span


    9 doesn’t predict single token-based tagsQ
    9 more useful for other types of phrases or
    overlapping spans

    View full-size slide

  26. youtube.com/ExplosionAI
    2 data for spancaC
    2 process with patterns and
    training temporary model'
    2 for consistent annotatio4
    2 ... and even more
    Annotate
    Speed up
    Tips and tricks
    explosion. /blog/spancat
    ai
    2 vs Named Entity
    Recognize…
    2 spancat work'
    2 spancat dataset'
    2 spancat with displaCˆ
    2 ... and more
    SpanCategorizer
    How
    Debug
    Visualize
    github.com/ /projects
    explosion
    2 data for spancaC
    2 an application of spancaC
    2 for your projecC
    2 ... and more
    Annotate
    See
    Template
    21

    View full-size slide

  27. Thank you
    for listening!
    22
    4 .co8
    4 twitter.com/
    4 linkedin.com/in/
    4 @explosion.ai
    victoriaslocum
    victorialslocu8
    victorialslocu8
    victoria

    View full-size slide