Slide 1

Slide 1 text

Natural Language Processing 
 Trends, Challenges and Opportunities @MarcoBonzanini 
 marcobonzanini.com PyData Global 2022

Slide 2

Slide 2 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Agenda for Today • Quick overview on NLP and current trends • Round table discussion • Your challenges? • Your success stories? 2

Slide 3

Slide 3 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Nice to meet you • Consulting, training and coaching 
 on Python + Data Science • Chair @ PyData London 3

Slide 4

Slide 4 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Language is Challenging 4

Slide 5

Slide 5 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Language is Challenging 5

Slide 6

Slide 6 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Language is Challenging • Language is evolving 
 
 
 
 6

Slide 7

Slide 7 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Language is Challenging • Language is evolving • Language is ambiguous 
 
 7

Slide 8

Slide 8 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Language is Challenging • Language is evolving • Language is ambiguous • (Understanding) Language requires context 8

Slide 9

Slide 9 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com We need annotated data 9

Slide 10

Slide 10 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com We need annotated data • Variability: domains and languages 
 
 
 
 
 
 
 10

Slide 11

Slide 11 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com We need annotated data • Variability: domains and languages • Available data: sparse+biased? 
 
 
 
 
 11

Slide 12

Slide 12 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com We need annotated data • Variability: domains and languages • Available data: sparse+biased? • Annotated data is the bottleneck 
 
 
 12

Slide 13

Slide 13 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com We need annotated data • Variability: domains and languages • Available data: sparse+biased? • Annotated data is the bottleneck • Vincent Warmerdam on Tools to Improve Training Data: https://www.youtube.com/watch?v=KRQJDLyc1uM 13

Slide 14

Slide 14 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Evolution of Models 14

Slide 15

Slide 15 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 15 Evolution of Models Bag-of-words

Slide 16

Slide 16 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 16 Evolution of Models Bag-of-words Word Embeddings 
 (circa 2013)

Slide 17

Slide 17 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 17 Evolution of Models Bag-of-words Word Embeddings 
 (circa 2013) “Traditional” 
 ML models

Slide 18

Slide 18 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 18 Evolution of Models Bag-of-words Word Embeddings 
 (circa 2013) “Traditional” 
 ML models RNN/LSTM (circa 2015)

Slide 19

Slide 19 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 19 Evolution of Models Bag-of-words Word Embeddings 
 (circa 2013) “Traditional” 
 ML models RNN/LSTM (circa 2015) Transformers (circa 2017)

Slide 20

Slide 20 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Transformers 20

Slide 21

Slide 21 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Transformers 21

Slide 22

Slide 22 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Transformers 22 Attention is all you need (Vaswani et al., 2017) 
 57K citations in November 2022

Slide 23

Slide 23 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Transformers • Parallelisation → training on bigger dataset • Fine-tuning on speci fi c task 23

Slide 24

Slide 24 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Bigger and Bigger Models 24

Slide 25

Slide 25 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com • BERT (2018): 345M parameters • GPT-2 (2019): 1.5B parameters • GPT-3 (2020): 175B parameters • Galactica (2022): 120B parameters 25 Bigger and Bigger Models

Slide 26

Slide 26 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Big Hype, Yet… 26

Slide 27

Slide 27 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Big Hype, Yet… 27 Bolukbasi et al., 2016 NIPS

Slide 28

Slide 28 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Big Hype, Yet… 28 Bolukbasi et al., 2016 NIPS • King - man + woman = Queen • Doctor - man + woman = Nurse? 
 
 • Word embeddings are not “neutral” 
 Bias in the data

Slide 29

Slide 29 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Big Hype, Yet… 29 https://twitter.com/Michael_J_Black/status/1593133722316189696

Slide 30

Slide 30 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Big Hype, Yet… 30 https://arstechnica.com/gadgets/2022/11/amazon-alexa-is-a-colossal-failure-on-pace-to-lose-10-billion-this-year/

Slide 31

Slide 31 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 31 Bender et al., 2021, ACM FAccT

Slide 32

Slide 32 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Python NLP Ecosystem 32

Slide 33

Slide 33 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Python NLP Ecosystem 33 NLTK

Slide 34

Slide 34 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Discussion 34 • “Let’s just use Deep Learning (TM)” • What if we don’t have millions of $$$? • Data annotation / quality: 
 still the main issue? • Your Success Stories? • Your Horror Stories?