Can my computer make jokes

Slide 1

Slide 1 text

Can my computer make jokes? by Diogo Pinto 2018-11-28

Slide 2

Slide 2 text

Can my computer make jokes? by Diogo Pinto 2018-11-28

Slide 3

Slide 3 text

About me • Trained as Software Engineer • Grown into a Data Scientist generalist • Other interests • Blockchain • Distributed systems • (Cyber- and non-cyber-) Security • Psychology and Philosophy • Kung Fu and Parkour practitioner #nofilters @diogojapinto /in/diogojapinto/ [email protected]

Slide 4

Slide 4 text

@diogojapinto /in/diogojapinto/ [email protected] About me • Trained as Software Engineer • Grown into a Data Scientist generalist • Other interests • Blockchain • Distributed systems • (Cyber- and non-cyber-) Security • Psychology and Philosophy • Kung Fu and Parkour practitioner #nofilters

Slide 5

Slide 5 text

Slide 6

Slide 6 text

The plan for today What is the role of Humor? How can a computer use words? Does my computer has better sense of humor than I do? What are the take-aways?

Slide 7

Slide 7 text

“Know your audience Luke”

Slide 8

Slide 8 text

“Know your audience Luke” • Who here is aware of recent Machine Learning achievements?

Slide 9

Slide 9 text

“Know your audience Luke” • Who here is aware of recent Machine Learning achievements? • Who here has some intuition behind the inner workings of Neural Networks (aka Differentiable Programming)?

Slide 10

Slide 10 text

Slide 11

Slide 11 text

What is the role of Humor? “A day without laughter is a day wasted” Charlie Chaplin

Slide 12

Slide 12 text

History of Humor [1] [1] The First Joke: Exploring the Evolutionary Origins of Humor

Slide 13

Slide 13 text

History of Humor [1] • Humor is complex and a high cognitive function [1] The First Joke: Exploring the Evolutionary Origins of Humor

Slide 14

Slide 14 text

History of Humor [1] • Humor is complex and a high cognitive function ● Nuanced verbal phrasing + Prevailing social dynamics [1] The First Joke: Exploring the Evolutionary Origins of Humor

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

History of Humor [1] • Humor is complex and a high cognitive function ● Nuanced verbal phrasing + Prevailing social dynamics ● Can combine language skills, theory-of-mind, symbolism, abstract thinking, social perception… • The basic ability seems “instinctive” ● Ubiquitous and Universal ● It is probably coded somehow in our genetic code ● People laugh without appreciation for the causal factors [1] The First Joke: Exploring the Evolutionary Origins of Humor

Slide 19

Slide 19 text

History of Humor [1] • Humor is complex and a high cognitive function ● Nuanced verbal phrasing + Prevailing social dynamics ● Can combine language skills, theory-of-mind, symbolism, abstract thinking, social perception… • The basic ability seems “instinctive” ● Ubiquitous and Universal ● It is probably coded somehow in our genetic code ● People laugh without appreciation for the causal factors • Humor dates back thousands of years [1] The First Joke: Exploring the Evolutionary Origins of Humor

Slide 20

Slide 20 text

History of Humor [1] • Humor is complex and a high cognitive function ● Nuanced verbal phrasing + Prevailing social dynamics ● Can combine language skills, theory-of-mind, symbolism, abstract thinking, social perception… • The basic ability seems “instinctive” ● Ubiquitous and Universal ● It is probably coded somehow in our genetic code ● People laugh without appreciation for the causal factors • Humor dates back thousands of years ● Greek “laughing philosopher” Democritus [1] The First Joke: Exploring the Evolutionary Origins of Humor

Slide 21

Slide 21 text

History of Humor [1] • Humor is complex and a high cognitive function ● Nuanced verbal phrasing + Prevailing social dynamics ● Can combine language skills, theory-of-mind, symbolism, abstract thinking, social perception… • The basic ability seems “instinctive” ● Ubiquitous and Universal ● It is probably coded somehow in our genetic code ● People laugh without appreciation for the causal factors • Humor dates back thousands of years ● Greek “laughing philosopher” Democritus ● Humor conversations observed in Australian aboriginals ● Lived genetically isolated for at least 35000 years [1] The First Joke: Exploring the Evolutionary Origins of Humor

Slide 22

Slide 22 text

Why Humor is relevant for Machine Learning[1] • Humor is complex and a high cognitive function ● Nuanced verbal phrasing + Prevailing social dynamics ● Can combine language skills, theory-of-mind, symbolism, abstract thinking, social perception… • The basic ability seems “instinctive” ● Ubiquitous and Universal ● It is probably coded somehow in our genetic code ● People laugh without appreciation for the causal factors • Humor dates back thousands of years ● Greek “laughing philosopher” Democritus ● Humor conversations observed in Australian aboriginals ● Lived genetically isolated for at least 35000 years [1] The First Joke: Exploring the Evolutionary Origins of Humor

Slide 23

Slide 23 text

Let’s look at the data

Slide 24

Slide 24 text

Let’s look at the data • Dataset of short jokes from Kaggle user avmoudgil95

Slide 25

Slide 25 text

Let’s look at the data • Dataset of short jokes from Kaggle user avmoudgil95 • E.g.: ● It's crazy how my ex was so upset about losing me that he had to build a life with a new woman.

Slide 26

Slide 26 text

Let’s look at the data • Dataset of short jokes from Kaggle user avmoudgil95 • E.g.: ● It's crazy how my ex was so upset about losing me that he had to build a life with a new woman.

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Let’s look at the data • Dataset of short jokes from Kaggle user avmoudgil95 • E.g.: ● It's crazy how my ex was so upset about losing me that he had to build a life with a new woman. ● Where does Noah keep his bees? In the Ark Hives ● What sex position produces the ugliest children? Ask your mother.

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

How can a computer use words? “A picture may be worth a thousand words, but well chosen, words will take you where pictures never can” Unknown

Slide 35

Slide 35 text

Why does representation matter?

Slide 36

Slide 36 text

Why does representation matter? Sender country risky “fee” count Spam Detector Machine Learning Model Words count

Slide 37

Slide 37 text

Why does representation matter? Sender country risky “fee” count Spam Detector Machine Learning Model Words count Sender country risky “fee” count Words count Spam? Email 1 True 0 203 True Email 2 False 2 345 False Email 3 True 10 180 True

Slide 38

Slide 38 text

Why does representation matter? Sender country risky “fee” count Spam Detector Machine Learning Model Words count Sender country risky “fee” count Words count “prince” count Spam? Email 1 True 0 203 5 True Email 2 False 2 345 0 False Email 3 True 10 180 3 True

Slide 39

Slide 39 text

Slide 40

Slide 40 text

Why does representation matter? Sender country risky “fee” count Spam Detector Machine Learning Model Words count Sender country risky “fee” count Spam? Email 1 True 0 True Email 2 False 2 False Email 3 True 10 True

Slide 41

Slide 41 text

Slide 42

Slide 42 text

The problem of Words Representation

Slide 43

Slide 43 text

The problem of Words Representation • An array of characters, each one a byte • Example: “I like beer a lot” “I own a lot of wine” • Representation: 49206c696b6520626565722061206c6f74 49206f776e2061206c6f74206f662077696e65

Slide 44

Slide 44 text

The problem of Words Representation • An array of characters, each one a byte ● Variable length ● Difficult comparison between entries • Example: “I like beer a lot” “I own a lot of wine” • Representation: 49206c696b6520626565722061206c6f74 49206f776e2061206c6f74206f662077696e65

Slide 45

Slide 45 text

The problem of Words Representation • An array of characters, each one a byte ● Variable length ● Difficult comparison between entries • Bag-of-Words • Example: “I like beer a lot” “I own a lot of wine” • Representation: I like beer a lot own of wine 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1 1

Slide 46

Slide 46 text

The problem of Words Representation • An array of characters, each one a byte ● Variable length ● Difficult comparison between entries • Bag-of-Words ● By discarding order we are able to generalize • Example: “I like beer a lot” “I own a lot of wine” • Representation: I like beer a lot own of wine 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1 1

Slide 47

Slide 47 text

The problem of Words Representation • An array of characters, each one a byte ● Variable length ● Difficult comparison between entries • Bag-of-Words ● By discarding order we are able to generalize • Term Frequency – Inverse Document Frequency • Example: “I like beer a lot” “I own a lot of wine” • Representation: I like beer a lot own of wine 0.1 0.3 0.6 0.1 0.4 0 0 0 0.1 0 0 0.1 0.4 0.5 0.1 0.7

Slide 48

Slide 48 text

The problem of Words Representation • An array of characters, each one a byte ● Variable length ● Difficult comparison between entries • Bag-of-Words ● By discarding order we are able to generalize • Term Frequency – Inverse Document Frequency ● Words frequency in document • Example: “I like beer a lot” “I own a lot of wine” • Representation: I like beer a lot own of wine 0.1 0.3 0.6 0.1 0.4 0 0 0 0.1 0 0 0.1 0.4 0.5 0.1 0.7

Slide 49

Slide 49 text

The problem of Words Representation • An array of characters, each one a byte ● Variable length ● Difficult comparison between entries • Bag-of-Words ● By discarding order we are able to generalize • Term Frequency – Inverse Document Frequency ● Words frequency in document ● Rarity of words across documents • Example: “I like beer a lot” “I own a lot of wine” • Representation: I like beer a lot own of wine 0.1 0.3 0.6 0.1 0.4 0 0 0 0.1 0 0 0.1 0.4 0.5 0.1 0.7

Slide 50

Slide 50 text

The problem of Words Representation • An array of characters, each one a byte ● Variable length ● Difficult comparison between entries • Bag-of-Words ● By discarding order we are able to generalize • Term Frequency – Inverse Document Frequency ● Words frequency in document ● Rarity of words across documents • Example: “I like beer a lot” “I own a lot of wine” • Representation: I like beer a lot own of wine 0.1 0.3 0.6 0.1 0.4 0 0 0 0.1 0 0 0.1 0.4 0.5 0.1 0.7

Slide 51

Slide 51 text

The problem of Words Representation • An array of characters, each one a byte ● Variable length ● Difficult comparison between entries • Bag-of-Words ● By discarding order we are able to generalize • Term Frequency – Inverse Document Frequency ● Words frequency in document ● Rarity of words across documents • Example: “I like beer a lot” “I own a lot of wine” • Representation: I like beer a lot own of wine 0.1 0.3 0.6 0.1 0.4 0 0 0 0.1 0 0 0.1 0.4 0.5 0.1 0.7

Slide 52

Slide 52 text

What about semantics?

Slide 53

Slide 53 text

What about semantics? Doc I like beer a lot own of wine beer wine own

Slide 54

Slide 54 text

What about semantics? • Meaning of words is lost Doc I like beer a lot own of wine beer 0 0 1 0 0 0 0 0 wine 0 0 0 0 0 0 0 1 own 0 0 0 0 0 1 0 0

Slide 55

Slide 55 text

What about semantics? • Meaning of words is lost ● Distance(wine, beer) = Distance(wine, own) wine own beer Doc I like beer a lot own of wine beer 0 0 1 0 0 0 0 0 wine 0 0 0 0 0 0 0 1 own 0 0 0 0 0 1 0 0

Slide 56

Slide 56 text

What about semantics? • Meaning of words is lost ● Distance(wine, beer) = Distance(wine, own) • Distributed representations can help

Slide 57

Slide 57 text

What about semantics? • Meaning of words is lost ● Distance(wine, beer) = Distance(wine, own) • Distributed representations can help Doc D1 D2 beer 0.2 0.7 wine 0.1 0.8 own 0.8 0.1

Slide 58

Slide 58 text

What about semantics? • Meaning of words is lost ● Distance(wine, beer) = Distance(wine, own) • Distributed representations can help Doc D1 D2 beer 0.2 0.7 wine 0.1 0.8 own 0.8 0.1 0 0,3 0,5 0,8 1 0 0,2 0,4 0,6 0,8 wine own beer

Slide 59

Slide 59 text

Slide 60

Slide 60 text

Slide 61

Slide 61 text

Slide 62

Slide 62 text

Slide 63

Slide 63 text

Slide 64

Slide 64 text

What about semantics? • Meaning of words is lost ● Distance(wine, beer) = Distance(wine, own) • Distributed representations can help ● Reduce the dimensionality footprint ● Semantics encoded as “proximity” • Word2Vec ● Start with “random” word representations with dimension d ● From the representation of a given word predict a randomly sampled context word I like beer a lot • Training examples: ● (I, like)

Slide 65

Slide 65 text

Slide 66

Slide 66 text

Slide 67

Slide 67 text

Slide 68

Slide 68 text

Slide 69

Slide 69 text

Slide 70

Slide 70 text

Slide 71

Slide 71 text

What about semantics? • Meaning of words is lost ● Distance(wine, beer) = Distance(wine, own) • Distributed representations can help ● Reduce the dimensionality footprint ● Semantics encoded as “proximity” • Word2Vec ● Start with “random” word representations with dimension d ● From the representation of a given word predict a randomly sampled context word I [0.8, 0.12, …] like … beer … a … lot … Shady Mathy Stuff

Slide 72

Slide 72 text

What about semantics? • Meaning of words is lost ● Distance(wine, beer) = Distance(wine, own) • Distributed representations can help ● Reduce the dimensionality footprint ● Semantics encoded as “proximity” • Word2Vec ● Start with “random” word representations with dimension d ● From the representation of a given word predict a randomly sampled context word I [0.8, 0.12, …] like … beer … a … lot … Shady Mathy Stuff Score(I) Score(like) Score(beer) … …

Slide 73

Slide 73 text

What about semantics? • Meaning of words is lost ● Distance(wine, beer) = Distance(wine, own) • Distributed representations can help ● Reduce the dimensionality footprint ● Semantics encoded as “proximity” • Word2Vec ● Start with “random” word representations with dimension d ● From the representation of a given word predict a randomly sampled context word ● Change the word representation in the direction that helps predicting the context word I [0.8, 0.12, …] like … beer … a … lot … Shady Mathy Stuff Score(I) Score(like) Score(beer) … …

Slide 74

Slide 74 text

What about semantics? • Meaning of words is lost ● Distance(wine, beer) = Distance(wine, own) • Distributed representations can help ● Reduce the dimensionality footprint ● Semantics encoded as “proximity” • Word2Vec ● Start with “random” word representations with dimension d ● From the representation of a given word predict a randomly sampled context word ● Change the word representation in the direction that helps predicting the context word I [0.7, 0.15, …] like … beer … a … lot … Shady Mathy Stuff Error(I) Error(like) Error(beer) … …

Slide 75

Slide 75 text

But I’m here to see Python code

Slide 76

Slide 76 text

But I’m here to see Python code >>> jokes[20]   "Why do you never see elephants hiding in trees? 'Cause they are freaking good at it"

Slide 77

Slide 77 text

But I’m here to see Python code • Pre-Processing ● Transform into word lists >>> import nltk >>> nltk.download('punkt') >>> tokenizer = nltk.data.load('english.pickle') >>> jokes_sentences = tokenizer.tokenize_sents(jokes) >>> jokes_sentences[20]   ['Why do you never see elephants hiding in trees?', "'Cause they are freaking good at it"]

Slide 78

Slide 78 text

But I’m here to see Python code >>> def sentence_to_wordlist(raw): ... clean = re.sub(r'[^a-zA-Z]', ' ', raw) ... words = clean.split() ... words_lower = [w.lower() for w in words] ... return words_lower   >>> jokes_word_lists = [ [sentence_to_wordlist(str(s)) for s in joke if len(s) > 0] for joke in jokes_sentences]   >>> jokes_word_lists[20]   [['why', 'do', 'you', 'never', 'see', 'elephants', 'hiding', 'in', 'trees'], ['cause', 'they', 'are', 'freaking', 'good', 'at', 'it']] • Pre-Processing ● Transform into word lists

Slide 79

Slide 79 text

But I’m here to see Python code >>> from nltk.corpus import stopwords >>> nltk.download('stopwords') >>> stop_ws = set(stopwords.words('english')) >>> jokes_non_stop_word_lists = [[[word for word in s if word not in stop_ws] for s in joke] for joke in jokes_word_lists] >>> jokes_non_stop_word_lists[20]   [['never', 'see', 'elephants', 'hiding', 'trees'], ['cause', 'freaking', 'good']] • Pre-Processing ● Transform into word lists ● Remove Stop Words

Slide 80

Slide 80 text

But I’m here to see Python code >>> from nltk.stem import PorterStemmer >>> porter = PorterStemmer() >>> jokes_stemmed = [[[porter.stem(w) for w in s] for s in joke] for joke in jokes_non_stop_word_lists] >>> jokes_stemmed[20]   [['never', 'see', 'eleph', 'hide', 'tree'], ['caus', 'freak', 'good']] • Pre-Processing ● Transform into word lists ● Remove Stop Words ● Stemming

Slide 81

Slide 81 text

But I’m here to see Python code >>> jokes_flatten = [s for j in jokes_stemmed for s in j] >>> num_features = 300 # dimensionality of the resulting word vectors >>> min_word_count = 3 # minimum word count threshold >>> num_workers = multiprocessing.cpu_count() # number of threads to # run in parallel >>> context_size = 7 # context window length >>> downsampling = 1e-3 # downlsample setting for frequent words >>> seed = 42 # seed for the rng, to make the results reproducible >>> sg = 1 # skip-gram (instead of CBOW)   >>> jokes2vec = w2v.Word2Vec( ... sg=sg, ... seed=seed, ... workers=num_workers, ... size=num_features, ... min_count=min_word_count, ... window=context_size, ... sample=downsampling ... ) >>> jokes2vec.build_vocab(jokes_flatten) >>> jokes2vec.train(jokes_flatten, total_examples=jokes2vec.corpus_count, epochs=jokes2vec.iter) • Pre-Processing ● Transform into word lists ● Remove Stop Words ● Stemming • Obtain word embeddings

Slide 82

Slide 82 text

But I’m here to see Python code >>> import seaborn as sns >>> sns.set(style="darkgrid") >>> from sklearn.manifold import TSNE >>> tsne = TSNE(n_components=2, random_state=seed) >>> all_word_vectors_matrix = jokes2vec.wv.syn0 >>> all_word_vectors_matrix_2d = tsne.fit_transform(all_word_vectors_matrix) >>> points = pd.DataFrame( ... [ ... (word, coords[0], coords[1]) ... for word, coords in [ ... (word, all_word_vectors_matrix_2d[jokes2vec.wv.vocab[word].index]) ... for word in jokes2vec.wv.vocab ... ] ... ], ... columns=['word', 'x', 'y'] ... )   >>> points.plot.scatter('x', 'y', s=10, figsize=(20, 12), alpha=0.6) • Pre-Processing ● Transform into word lists ● Remove Stop Words ● Stemming • Obtain word embeddings • Visualize