Adam Palay - "Words, words, words": Reading Shakespeare with Python

Words, words, words Reading Shakespeare with Python

Prologue

Motivation How can we use Python to supplement our reading
of Shakespeare? How can we get Python to read for us?

Why Shakespeare? Polonius: What do you read, my lord? Hamlet:
Words, words, words. P: What is the matter, my lord? H: Between who? P: I mean, the matter that you read, my lord. --II.2.184

Why Shakespeare? (Also the XML) (thank you, https://github. com/severdia/PlayShakespeare. com-XML
!!!)

Shakespeare XML

Challenges • Language, especially English, is messy • Texts are
usually unstructured • Pronunciation is not standard • Reading is pretty hard!

Humans and Computers Nuance Ambiguity Close reading Counting Repetitive tasks
Making graphs Humans are good at: Computers are good at:

Act II

(leveraging metadata) Who is the main Character in _______?

Who is the main character in Hamlet? Number of Lines

Who is the main character in King Lear? Number of
Lines

Who is the main character in Macbeth? Number of Lines

Who is the main character in Othello? Number of Lines

Iago and Othello, Detail Number of Lines

Obligatory Social Network

Act III

First steps with natural language processing (NLP) What are Shakespeare’s
most interesting rhymes?

Shakespeare’s Sonnets • A sonnet is 14 line poem •
There are many different rhyme schemes a sonnet can have; Shakespeare was pretty unique in choosing one • This is a huge win for us, since we can “hard code” his rhyme scheme in our analysis

Shall I compare thee to a summer’s day? Thou art
more lovely and more temperate: Rough winds do shake the darling buds of May, And summer’s lease hath all too short a date; Sometime too hot the eye of heaven shines, And often is his gold complexion dimm'd; And every fair from fair sometime declines, By chance or nature’s changing course untrimm'd; But thy eternal summer shall not fade, Nor lose possession of that fair thou ow’st; Nor shall death brag thou wander’st in his shade, When in eternal lines to time thou grow’st: So long as men can breathe or eyes can see, So long lives this, and this gives life to thee. http://www.poetryfoundation.org/poem/174354 a b a b c d c d e f e f g g Sonnet 18

Rhyme Distribution • Most common rhymes • nltk.FreqDict Frequency Distribution
• Given a word, what is the frequency distribution of the words that rhyme with it? • nltk.ConditionalFreqDict Conditional Frequency Distribution

Rhyme Distribution

1) “Boring” rhymes: “me” and “thee” 2) “Lopsided” rhymes: “thee”
and “usury” Interesting Rhymes?

Act IV

Classifiers 101 Writing code that reads

Our Classifier Can we write code to tell if a
given speech is from a tragedy or comedy?

• Requires labeled text ◦ (in this case, speeches labeled
by genre) ◦ [(<speech>, <genre>), ...] • Requires “training” • Predicts labels of text Classifiers: overview

Classifiers: ingredients • Classifier • Vectorizer, or Feature Extractor •
Classifiers only interact with features, not the text itself

Vectorizers (or Feature Extractors) • A vectorizer, or feature extractor,
transforms a text into quantifiable information about the text. • Theoretically, these features could be anything. i.e.: ◦ How many capital letters does the text contain? ◦ Does the text end with an exclamation point? • In practice, a common model is “Bag of Words”.

Bag of Words is a kind of feature extraction where:
• The set of features is the set of all words in the text you’re analyzing • A single text is represented by how many of each word appears in it Bag of Words

Bag of Words: Simple Example Two texts: • “Hello, Will!”
• “Hello, Globe!”

• “Hello, Globe!” Bag: [“Hello”, “Will”, “Globe”] “Hello” “Will” “Globe”

• “Hello, Globe!” Bag: [“Hello”, “Will”, “Globe”] “Hello” “Will” “Globe” “Hello, Will” 1 1 0 “Hello, Globe” 1 0 1

• “Hello, Globe!” “Hello” “Will” “Globe” “Hello, Will” 1 1 0 “Hello, Globe” 1 0 1 “Hello, Will” → “A text that contains one instance of the word “Hello”, contains one instance of the word “Will”, and does not contain the word “Globe”. (Less readable for us, more readable for computers!)

Live Vectorizer:

Why are these called “Vectorizers”? text_1 = "words, words, words"
text_2 = "words, words, birds" # times “birds” is used # times “words” is used text_2 text_1

Putting it all Together Classifier Workflow

Classification: Steps 1) Split pre-labeled text into training and testing
sets 2) Vectorize text (extract features) 3) Train classifier 4) Test classifier Text → Features → Labels

Training

Classifier Training from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB
vectorizer = CountVectorizer() vectorizer.fit(train_speeches) train_features = vectorizer.transform(train_speeches) classifier = MultinomialNB() classifier.fit(train_features, train_labels)

Testing

test_speech = test_speeches[0] print test_speech Farewell, Andronicus, my noble father,
The woefull'st man that ever liv'd in Rome. Farewell, proud Rome, till Lucius come again; He loves his pledges dearer than his life. ... (From Titus Andronicus, III.1.288-300) Classifier Testing

Classifier Testing test_speech = test_speeches[0] test_label = test_labels[0] test_features =
vectorizer.transform([test_speech]) prediction = classifier.predict(test_features)[0] print prediction >>> 'tragedy' print test_label >>> 'tragedy'

test_features = vectorizer.transform(test_speeches) print classifier.score(test_features, test_labels) >>> 0.75427682737169521 Classifier Testing

Critiques • "Bag of Words" assumes a correlation between word
use and label. This correlation is stronger in some cases than in others. • Beware of highly-disproportionate training data.

Epilogue

[email protected] @adampalay www.adampalay.com Thank you!

Adam Palay - "Words, words, words": Reading Sha...

Adam Palay - "Words, words, words": Reading Shakespeare with Python

More Decks by PyCon 2015

Other Decks in Programming

Featured

Transcript