Adam Palay - "Words, words, words": Reading Shakespeare with Python

Adam Palay - "Words, words, words": Reading Shakespeare with Python

This talk will give an introduction to text analysis with Python by asking some questions about Shakespeare and discussing the quantitative methods that will go in to answering them. While we’ll use Shakespeare to illustrate our methodologies, we’ll also discuss how they can be ported over into more 21st century texts, like tweets or New York Times articles.


PyCon 2015

April 18, 2015


  1. Words, words, words Reading Shakespeare with Python

  2. Prologue

  3. Motivation How can we use Python to supplement our reading

    of Shakespeare? How can we get Python to read for us?
  4. Act I

  5. Why Shakespeare? Polonius: What do you read, my lord? Hamlet:

    Words, words, words. P: What is the matter, my lord? H: Between who? P: I mean, the matter that you read, my lord. --II.2.184
  6. Why Shakespeare? (Also the XML) (thank you, https://github. com/severdia/PlayShakespeare. com-XML

  7. Shakespeare XML

  8. Shakespeare XML

  9. Challenges • Language, especially English, is messy • Texts are

    usually unstructured • Pronunciation is not standard • Reading is pretty hard!
  10. Humans and Computers Nuance Ambiguity Close reading Counting Repetitive tasks

    Making graphs Humans are good at: Computers are good at:
  11. Act II

  12. (leveraging metadata) Who is the main Character in _______?

  13. Who is the main character in Hamlet? Number of Lines

  14. Who is the main character in King Lear? Number of

  15. Who is the main character in Macbeth? Number of Lines

  16. Who is the main character in Othello? Number of Lines

  17. Iago and Othello, Detail Number of Lines

  18. Obligatory Social Network

  19. Act III

  20. First steps with natural language processing (NLP) What are Shakespeare’s

    most interesting rhymes?
  21. Shakespeare’s Sonnets • A sonnet is 14 line poem •

    There are many different rhyme schemes a sonnet can have; Shakespeare was pretty unique in choosing one • This is a huge win for us, since we can “hard code” his rhyme scheme in our analysis
  22. Shall I compare thee to a summer’s day? Thou art

    more lovely and more temperate: Rough winds do shake the darling buds of May, And summer’s lease hath all too short a date; Sometime too hot the eye of heaven shines, And often is his gold complexion dimm'd; And every fair from fair sometime declines, By chance or nature’s changing course untrimm'd; But thy eternal summer shall not fade, Nor lose possession of that fair thou ow’st; Nor shall death brag thou wander’st in his shade, When in eternal lines to time thou grow’st: So long as men can breathe or eyes can see, So long lives this, and this gives life to thee. a b a b c d c d e f e f g g Sonnet 18
  23. Rhyme Distribution • Most common rhymes • nltk.FreqDict Frequency Distribution

    • Given a word, what is the frequency distribution of the words that rhyme with it? • nltk.ConditionalFreqDict Conditional Frequency Distribution
  24. Rhyme Distribution

  25. Rhyme Distribution

  26. 1) “Boring” rhymes: “me” and “thee” 2) “Lopsided” rhymes: “thee”

    and “usury” Interesting Rhymes?
  27. Act IV

  28. Classifiers 101 Writing code that reads

  29. Our Classifier Can we write code to tell if a

    given speech is from a tragedy or comedy?
  30. • Requires labeled text ◦ (in this case, speeches labeled

    by genre) ◦ [(<speech>, <genre>), ...] • Requires “training” • Predicts labels of text Classifiers: overview
  31. Classifiers: ingredients • Classifier • Vectorizer, or Feature Extractor •

    Classifiers only interact with features, not the text itself
  32. Vectorizers (or Feature Extractors) • A vectorizer, or feature extractor,

    transforms a text into quantifiable information about the text. • Theoretically, these features could be anything. i.e.: ◦ How many capital letters does the text contain? ◦ Does the text end with an exclamation point? • In practice, a common model is “Bag of Words”.
  33. Bag of Words is a kind of feature extraction where:

    • The set of features is the set of all words in the text you’re analyzing • A single text is represented by how many of each word appears in it Bag of Words
  34. Bag of Words: Simple Example Two texts: • “Hello, Will!”

    • “Hello, Globe!”
  35. Bag of Words: Simple Example Two texts: • “Hello, Will!”

    • “Hello, Globe!” Bag: [“Hello”, “Will”, “Globe”] “Hello” “Will” “Globe”
  36. Bag of Words: Simple Example Two texts: • “Hello, Will!”

    • “Hello, Globe!” Bag: [“Hello”, “Will”, “Globe”] “Hello” “Will” “Globe” “Hello, Will” 1 1 0 “Hello, Globe” 1 0 1
  37. Bag of Words: Simple Example Two texts: • “Hello, Will!”

    • “Hello, Globe!” “Hello” “Will” “Globe” “Hello, Will” 1 1 0 “Hello, Globe” 1 0 1 “Hello, Will” → “A text that contains one instance of the word “Hello”, contains one instance of the word “Will”, and does not contain the word “Globe”. (Less readable for us, more readable for computers!)
  38. Live Vectorizer:

  39. Why are these called “Vectorizers”? text_1 = "words, words, words"

    text_2 = "words, words, birds" # times “birds” is used # times “words” is used text_2 text_1
  40. Act V

  41. Putting it all Together Classifier Workflow

  42. Classification: Steps 1) Split pre-labeled text into training and testing

    sets 2) Vectorize text (extract features) 3) Train classifier 4) Test classifier Text → Features → Labels
  43. Training

  44. Classifier Training from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB

    vectorizer = CountVectorizer() train_features = vectorizer.transform(train_speeches) classifier = MultinomialNB(), train_labels)
  45. Testing

  46. test_speech = test_speeches[0] print test_speech Farewell, Andronicus, my noble father,

    The woefull'st man that ever liv'd in Rome. Farewell, proud Rome, till Lucius come again; He loves his pledges dearer than his life. ... (From Titus Andronicus, III.1.288-300) Classifier Testing
  47. Classifier Testing test_speech = test_speeches[0] test_label = test_labels[0] test_features =

    vectorizer.transform([test_speech]) prediction = classifier.predict(test_features)[0] print prediction >>> 'tragedy' print test_label >>> 'tragedy'
  48. test_features = vectorizer.transform(test_speeches) print classifier.score(test_features, test_labels) >>> 0.75427682737169521 Classifier Testing

  49. Critiques • "Bag of Words" assumes a correlation between word

    use and label. This correlation is stronger in some cases than in others. • Beware of highly-disproportionate training data.
  50. Epilogue

  51. @adampalay Thank you!