Adam Palay - "Words, words, words": Reading Shakespeare with Python

Slide 1

Slide 1 text

Words, words, words Reading Shakespeare with Python

Slide 2

Slide 2 text

Prologue

Slide 3

Slide 3 text

Motivation How can we use Python to supplement our reading of Shakespeare? How can we get Python to read for us?

Slide 4

Slide 4 text

Act I

Slide 5

Slide 5 text

Why Shakespeare? Polonius: What do you read, my lord? Hamlet: Words, words, words. P: What is the matter, my lord? H: Between who? P: I mean, the matter that you read, my lord. --II.2.184

Slide 6

Slide 6 text

Why Shakespeare? (Also the XML) (thank you, https://github. com/severdia/PlayShakespeare. com-XML !!!)

Slide 7

Slide 7 text

Shakespeare XML

Slide 8

Slide 8 text

Shakespeare XML

Slide 9

Slide 9 text

Challenges • Language, especially English, is messy • Texts are usually unstructured • Pronunciation is not standard • Reading is pretty hard!

Slide 10

Slide 10 text

Humans and Computers Nuance Ambiguity Close reading Counting Repetitive tasks Making graphs Humans are good at: Computers are good at:

Slide 11

Slide 11 text

Act II

Slide 12

Slide 12 text

(leveraging metadata) Who is the main Character in _______?

Slide 13

Slide 13 text

Who is the main character in Hamlet? Number of Lines

Slide 14

Slide 14 text

Who is the main character in King Lear? Number of Lines

Slide 15

Slide 15 text

Who is the main character in Macbeth? Number of Lines

Slide 16

Slide 16 text

Who is the main character in Othello? Number of Lines

Slide 17

Slide 17 text

Iago and Othello, Detail Number of Lines

Slide 18

Slide 18 text

Obligatory Social Network

Slide 19

Slide 19 text

Act III

Slide 20

Slide 20 text

First steps with natural language processing (NLP) What are Shakespeare’s most interesting rhymes?

Slide 21

Slide 21 text

Shakespeare’s Sonnets • A sonnet is 14 line poem • There are many different rhyme schemes a sonnet can have; Shakespeare was pretty unique in choosing one • This is a huge win for us, since we can “hard code” his rhyme scheme in our analysis

Slide 22

Slide 22 text

Shall I compare thee to a summer’s day? Thou art more lovely and more temperate: Rough winds do shake the darling buds of May, And summer’s lease hath all too short a date; Sometime too hot the eye of heaven shines, And often is his gold complexion dimm'd; And every fair from fair sometime declines, By chance or nature’s changing course untrimm'd; But thy eternal summer shall not fade, Nor lose possession of that fair thou ow’st; Nor shall death brag thou wander’st in his shade, When in eternal lines to time thou grow’st: So long as men can breathe or eyes can see, So long lives this, and this gives life to thee. http://www.poetryfoundation.org/poem/174354 a b a b c d c d e f e f g g Sonnet 18

Slide 23

Slide 23 text

Rhyme Distribution • Most common rhymes • nltk.FreqDict Frequency Distribution • Given a word, what is the frequency distribution of the words that rhyme with it? • nltk.ConditionalFreqDict Conditional Frequency Distribution

Slide 24

Slide 24 text

Rhyme Distribution

Slide 25

Slide 25 text

Rhyme Distribution

Slide 26

Slide 26 text

1) “Boring” rhymes: “me” and “thee” 2) “Lopsided” rhymes: “thee” and “usury” Interesting Rhymes?

Slide 27

Slide 27 text

Act IV

Slide 28

Slide 28 text

Classifiers 101 Writing code that reads

Slide 29

Slide 29 text

Our Classifier Can we write code to tell if a given speech is from a tragedy or comedy?

Slide 30

Slide 30 text

● Requires labeled text ○ (in this case, speeches labeled by genre) ○ [(, ), ...] ● Requires “training” ● Predicts labels of text Classifiers: overview

Slide 31

Slide 31 text

Classifiers: ingredients ● Classifier ● Vectorizer, or Feature Extractor ● Classifiers only interact with features, not the text itself

Slide 32

Slide 32 text

Vectorizers (or Feature Extractors) ● A vectorizer, or feature extractor, transforms a text into quantifiable information about the text. ● Theoretically, these features could be anything. i.e.: ○ How many capital letters does the text contain? ○ Does the text end with an exclamation point? ● In practice, a common model is “Bag of Words”.

Slide 33

Slide 33 text

Bag of Words is a kind of feature extraction where: ● The set of features is the set of all words in the text you’re analyzing ● A single text is represented by how many of each word appears in it Bag of Words

Slide 34

Slide 34 text

Bag of Words: Simple Example Two texts: ● “Hello, Will!” ● “Hello, Globe!”

Slide 35

Slide 35 text

Bag of Words: Simple Example Two texts: ● “Hello, Will!” ● “Hello, Globe!” Bag: [“Hello”, “Will”, “Globe”] “Hello” “Will” “Globe”

Slide 36

Slide 36 text

Bag of Words: Simple Example Two texts: ● “Hello, Will!” ● “Hello, Globe!” Bag: [“Hello”, “Will”, “Globe”] “Hello” “Will” “Globe” “Hello, Will” 1 1 0 “Hello, Globe” 1 0 1

Slide 37

Slide 37 text

Bag of Words: Simple Example Two texts: ● “Hello, Will!” ● “Hello, Globe!” “Hello” “Will” “Globe” “Hello, Will” 1 1 0 “Hello, Globe” 1 0 1 “Hello, Will” → “A text that contains one instance of the word “Hello”, contains one instance of the word “Will”, and does not contain the word “Globe”. (Less readable for us, more readable for computers!)

Slide 38

Slide 38 text

Live Vectorizer:

Slide 39

Slide 39 text

Why are these called “Vectorizers”? text_1 = "words, words, words" text_2 = "words, words, birds" # times “birds” is used # times “words” is used text_2 text_1

Slide 40

Slide 40 text

Act V

Slide 41

Slide 41 text

Putting it all Together Classifier Workflow

Slide 42

Slide 42 text

Classification: Steps 1) Split pre-labeled text into training and testing sets 2) Vectorize text (extract features) 3) Train classifier 4) Test classifier Text → Features → Labels

Slide 43

Slide 43 text

Training

Slide 44

Slide 44 text

Classifier Training from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB vectorizer = CountVectorizer() vectorizer.fit(train_speeches) train_features = vectorizer.transform(train_speeches) classifier = MultinomialNB() classifier.fit(train_features, train_labels)

Slide 45

Slide 45 text

Testing

Slide 46

Slide 46 text

test_speech = test_speeches[0] print test_speech Farewell, Andronicus, my noble father, The woefull'st man that ever liv'd in Rome. Farewell, proud Rome, till Lucius come again; He loves his pledges dearer than his life. ... (From Titus Andronicus, III.1.288-300) Classifier Testing

Slide 47

Slide 47 text

Classifier Testing test_speech = test_speeches[0] test_label = test_labels[0] test_features = vectorizer.transform([test_speech]) prediction = classifier.predict(test_features)[0] print prediction >>> 'tragedy' print test_label >>> 'tragedy'

Slide 48

Slide 48 text

test_features = vectorizer.transform(test_speeches) print classifier.score(test_features, test_labels) >>> 0.75427682737169521 Classifier Testing

Slide 49

Slide 49 text

Critiques • "Bag of Words" assumes a correlation between word use and label. This correlation is stronger in some cases than in others. • Beware of highly-disproportionate training data.