Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Natural Language Processing Expert Briefing @ PyData Global 2022

Natural Language Processing Expert Briefing @ PyData Global 2022

Slides for the presentation at the Expert Briefings @ PyData Global 2022

Speaker: Marco Bonzanini
https://www.twitter.com/marcobonzanini
https://marcobonzanini.com/

Marco Bonzanini

November 24, 2022
Tweet

More Decks by Marco Bonzanini

Other Decks in Technology

Transcript

  1. Natural Language Processing

    Trends, Challenges and Opportunities
    @MarcoBonzanini

    marcobonzanini.com
    PyData Global 2022

    View Slide

  2. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    Agenda for Today
    • Quick overview on NLP and current trends


    • Round table discussion


    • Your challenges?


    • Your success stories?
    2

    View Slide

  3. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    Nice to meet you
    • Consulting, training and coaching

    on Python + Data Science


    • Chair @ PyData London
    3

    View Slide

  4. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    Language is Challenging
    4

    View Slide

  5. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    Language is Challenging
    5

    View Slide

  6. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    Language is Challenging
    • Language is evolving




    6

    View Slide

  7. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    Language is Challenging
    • Language is evolving


    • Language is ambiguous


    7

    View Slide

  8. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    Language is Challenging
    • Language is evolving


    • Language is ambiguous


    • (Understanding) Language requires context
    8

    View Slide

  9. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    We need annotated data
    9

    View Slide

  10. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    We need annotated data
    • Variability: domains and languages







    10

    View Slide

  11. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    We need annotated data
    • Variability: domains and languages


    • Available data: sparse+biased?





    11

    View Slide

  12. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    We need annotated data
    • Variability: domains and languages


    • Available data: sparse+biased?


    • Annotated data is the bottleneck



    12

    View Slide

  13. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    We need annotated data
    • Variability: domains and languages


    • Available data: sparse+biased?


    • Annotated data is the bottleneck


    • Vincent Warmerdam on Tools to Improve Training
    Data: https://www.youtube.com/watch?v=KRQJDLyc1uM
    13

    View Slide

  14. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    Evolution of Models
    14

    View Slide

  15. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 15
    Evolution of Models
    Bag-of-words

    View Slide

  16. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 16
    Evolution of Models
    Bag-of-words
    Word Embeddings

    (circa 2013)

    View Slide

  17. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 17
    Evolution of Models
    Bag-of-words
    Word Embeddings

    (circa 2013)
    “Traditional”

    ML models

    View Slide

  18. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 18
    Evolution of Models
    Bag-of-words
    Word Embeddings

    (circa 2013)
    “Traditional”

    ML models
    RNN/LSTM


    (circa 2015)

    View Slide

  19. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 19
    Evolution of Models
    Bag-of-words
    Word Embeddings

    (circa 2013)
    “Traditional”

    ML models
    RNN/LSTM


    (circa 2015)
    Transformers


    (circa 2017)

    View Slide

  20. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    Transformers
    20

    View Slide

  21. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    Transformers
    21

    View Slide

  22. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    Transformers
    22
    Attention is all you need (Vaswani et al., 2017)

    57K citations in November 2022


    View Slide

  23. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    Transformers
    • Parallelisation → training on bigger dataset


    • Fine-tuning on speci
    fi
    c task
    23

    View Slide

  24. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    Bigger and Bigger Models
    24

    View Slide

  25. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    • BERT (2018): 345M parameters


    • GPT-2 (2019): 1.5B parameters


    • GPT-3 (2020): 175B parameters


    • Galactica (2022): 120B parameters
    25
    Bigger and Bigger Models

    View Slide

  26. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    Big Hype, Yet…
    26

    View Slide

  27. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    Big Hype, Yet…
    27
    Bolukbasi et al., 2016 NIPS

    View Slide

  28. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    Big Hype, Yet…
    28
    Bolukbasi et al., 2016 NIPS
    • King - man + woman = Queen


    • Doctor - man + woman = Nurse?


    • Word embeddings are not “neutral”

    Bias in the data

    View Slide

  29. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    Big Hype, Yet…
    29
    https://twitter.com/Michael_J_Black/status/1593133722316189696

    View Slide

  30. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    Big Hype, Yet…
    30
    https://arstechnica.com/gadgets/2022/11/amazon-alexa-is-a-colossal-failure-on-pace-to-lose-10-billion-this-year/

    View Slide

  31. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 31
    Bender et al., 2021, ACM FAccT

    View Slide

  32. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    Python NLP Ecosystem
    32

    View Slide

  33. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    Python NLP Ecosystem
    33
    NLTK

    View Slide

  34. © Bonzanini Consulting Ltd — BonzaniniConsulting.com
    Discussion
    34
    • “Let’s just use Deep Learning (TM)”


    • What if we don’t have millions of $$$?


    • Data annotation / quality:

    still the main issue?


    • Your Success Stories?


    • Your Horror Stories?

    View Slide