Link
Embed
Share
Beginning
This slide
Copy link URL
Copy link URL
Copy iframe embed code
Copy iframe embed code
Copy javascript embed code
Copy javascript embed code
Share
Tweet
Share
Tweet
Slide 1
Slide 1 text
Natural Language Processing Trends, Challenges and Opportunities @MarcoBonzanini marcobonzanini.com PyData Global 2022
Slide 2
Slide 2 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com Agenda for Today • Quick overview on NLP and current trends • Round table discussion • Your challenges? • Your success stories? 2
Slide 3
Slide 3 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com Nice to meet you • Consulting, training and coaching on Python + Data Science • Chair @ PyData London 3
Slide 4
Slide 4 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com Language is Challenging 4
Slide 5
Slide 5 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com Language is Challenging 5
Slide 6
Slide 6 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com Language is Challenging • Language is evolving 6
Slide 7
Slide 7 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com Language is Challenging • Language is evolving • Language is ambiguous 7
Slide 8
Slide 8 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com Language is Challenging • Language is evolving • Language is ambiguous • (Understanding) Language requires context 8
Slide 9
Slide 9 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com We need annotated data 9
Slide 10
Slide 10 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com We need annotated data • Variability: domains and languages 10
Slide 11
Slide 11 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com We need annotated data • Variability: domains and languages • Available data: sparse+biased? 11
Slide 12
Slide 12 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com We need annotated data • Variability: domains and languages • Available data: sparse+biased? • Annotated data is the bottleneck 12
Slide 13
Slide 13 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com We need annotated data • Variability: domains and languages • Available data: sparse+biased? • Annotated data is the bottleneck • Vincent Warmerdam on Tools to Improve Training Data: https://www.youtube.com/watch?v=KRQJDLyc1uM 13
Slide 14
Slide 14 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com Evolution of Models 14
Slide 15
Slide 15 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com 15 Evolution of Models Bag-of-words
Slide 16
Slide 16 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com 16 Evolution of Models Bag-of-words Word Embeddings (circa 2013)
Slide 17
Slide 17 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com 17 Evolution of Models Bag-of-words Word Embeddings (circa 2013) “Traditional” ML models
Slide 18
Slide 18 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com 18 Evolution of Models Bag-of-words Word Embeddings (circa 2013) “Traditional” ML models RNN/LSTM (circa 2015)
Slide 19
Slide 19 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com 19 Evolution of Models Bag-of-words Word Embeddings (circa 2013) “Traditional” ML models RNN/LSTM (circa 2015) Transformers (circa 2017)
Slide 20
Slide 20 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com Transformers 20
Slide 21
Slide 21 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com Transformers 21
Slide 22
Slide 22 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com Transformers 22 Attention is all you need (Vaswani et al., 2017) 57K citations in November 2022
Slide 23
Slide 23 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com Transformers • Parallelisation → training on bigger dataset • Fine-tuning on speci fi c task 23
Slide 24
Slide 24 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com Bigger and Bigger Models 24
Slide 25
Slide 25 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com • BERT (2018): 345M parameters • GPT-2 (2019): 1.5B parameters • GPT-3 (2020): 175B parameters • Galactica (2022): 120B parameters 25 Bigger and Bigger Models
Slide 26
Slide 26 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com Big Hype, Yet… 26
Slide 27
Slide 27 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com Big Hype, Yet… 27 Bolukbasi et al., 2016 NIPS
Slide 28
Slide 28 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com Big Hype, Yet… 28 Bolukbasi et al., 2016 NIPS • King - man + woman = Queen • Doctor - man + woman = Nurse? • Word embeddings are not “neutral” Bias in the data
Slide 29
Slide 29 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com Big Hype, Yet… 29 https://twitter.com/Michael_J_Black/status/1593133722316189696
Slide 30
Slide 30 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com Big Hype, Yet… 30 https://arstechnica.com/gadgets/2022/11/amazon-alexa-is-a-colossal-failure-on-pace-to-lose-10-billion-this-year/
Slide 31
Slide 31 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com 31 Bender et al., 2021, ACM FAccT
Slide 32
Slide 32 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com Python NLP Ecosystem 32
Slide 33
Slide 33 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com Python NLP Ecosystem 33 NLTK
Slide 34
Slide 34 text
© Bonzanini Consulting Ltd — BonzaniniConsulting.com Discussion 34 • “Let’s just use Deep Learning (TM)” • What if we don’t have millions of $$$? • Data annotation / quality: still the main issue? • Your Success Stories? • Your Horror Stories?