Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Natural Language Processing Expert Briefing @ PyData Global 2022

Natural Language Processing Expert Briefing @ PyData Global 2022

Slides for the presentation at the Expert Briefings @ PyData Global 2022

Speaker: Marco Bonzanini
https://www.twitter.com/marcobonzanini
https://marcobonzanini.com/

Marco Bonzanini

November 24, 2022
Tweet

More Decks by Marco Bonzanini

Other Decks in Technology

Transcript

  1. Natural Language Processing 
 Trends, Challenges and Opportunities @MarcoBonzanini 


    marcobonzanini.com PyData Global 2022
  2. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Agenda for Today •

    Quick overview on NLP and current trends • Round table discussion • Your challenges? • Your success stories? 2
  3. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Nice to meet you

    • Consulting, training and coaching 
 on Python + Data Science • Chair @ PyData London 3
  4. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Language is Challenging 4

  5. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Language is Challenging 5

  6. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Language is Challenging •

    Language is evolving 
 
 
 
 6
  7. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Language is Challenging •

    Language is evolving • Language is ambiguous 
 
 7
  8. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Language is Challenging •

    Language is evolving • Language is ambiguous • (Understanding) Language requires context 8
  9. © Bonzanini Consulting Ltd — BonzaniniConsulting.com We need annotated data

    9
  10. © Bonzanini Consulting Ltd — BonzaniniConsulting.com We need annotated data

    • Variability: domains and languages 
 
 
 
 
 
 
 10
  11. © Bonzanini Consulting Ltd — BonzaniniConsulting.com We need annotated data

    • Variability: domains and languages • Available data: sparse+biased? 
 
 
 
 
 11
  12. © Bonzanini Consulting Ltd — BonzaniniConsulting.com We need annotated data

    • Variability: domains and languages • Available data: sparse+biased? • Annotated data is the bottleneck 
 
 
 12
  13. © Bonzanini Consulting Ltd — BonzaniniConsulting.com We need annotated data

    • Variability: domains and languages • Available data: sparse+biased? • Annotated data is the bottleneck • Vincent Warmerdam on Tools to Improve Training Data: https://www.youtube.com/watch?v=KRQJDLyc1uM 13
  14. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Evolution of Models 14

  15. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 15 Evolution of Models

    Bag-of-words
  16. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 16 Evolution of Models

    Bag-of-words Word Embeddings 
 (circa 2013)
  17. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 17 Evolution of Models

    Bag-of-words Word Embeddings 
 (circa 2013) “Traditional” 
 ML models
  18. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 18 Evolution of Models

    Bag-of-words Word Embeddings 
 (circa 2013) “Traditional” 
 ML models RNN/LSTM (circa 2015)
  19. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 19 Evolution of Models

    Bag-of-words Word Embeddings 
 (circa 2013) “Traditional” 
 ML models RNN/LSTM (circa 2015) Transformers (circa 2017)
  20. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Transformers 20

  21. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Transformers 21

  22. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Transformers 22 Attention is

    all you need (Vaswani et al., 2017) 
 57K citations in November 2022
  23. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Transformers • Parallelisation →

    training on bigger dataset • Fine-tuning on speci fi c task 23
  24. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Bigger and Bigger Models

    24
  25. © Bonzanini Consulting Ltd — BonzaniniConsulting.com • BERT (2018): 345M

    parameters • GPT-2 (2019): 1.5B parameters • GPT-3 (2020): 175B parameters • Galactica (2022): 120B parameters 25 Bigger and Bigger Models
  26. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Big Hype, Yet… 26

  27. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Big Hype, Yet… 27

    Bolukbasi et al., 2016 NIPS
  28. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Big Hype, Yet… 28

    Bolukbasi et al., 2016 NIPS • King - man + woman = Queen • Doctor - man + woman = Nurse? 
 
 • Word embeddings are not “neutral” 
 Bias in the data
  29. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Big Hype, Yet… 29

    https://twitter.com/Michael_J_Black/status/1593133722316189696
  30. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Big Hype, Yet… 30

    https://arstechnica.com/gadgets/2022/11/amazon-alexa-is-a-colossal-failure-on-pace-to-lose-10-billion-this-year/
  31. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 31 Bender et al.,

    2021, ACM FAccT
  32. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Python NLP Ecosystem 32

  33. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Python NLP Ecosystem 33

    NLTK
  34. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Discussion 34 • “Let’s

    just use Deep Learning (TM)” • What if we don’t have millions of $$$? • Data annotation / quality: 
 still the main issue? • Your Success Stories? • Your Horror Stories?