Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data driven literary analysis: an unsupervised approach to text analysis and classification

Data driven literary analysis: an unsupervised approach to text analysis and classification

Unsupervised document classification addresses the problem of assigning categories to documents without the use of a training set or predefined categories. This is useful to enhance information retrieval, the basic assumption being that similar contents are also relevant to the same query. A similar assumption is made in literature to define literary genres and sub-genres, where works which share specific conventions in terms of form and content are described by the same genre.

The talk gives an overview of document clustering and its challenges, with a focus on dimensionality reduction and how to address it with topic modelling techniques like LDA (Latent Dirichlet Allocation). Using Shakespeare’s body of work as a case study, the talk describes how to use nltk, sklearn and gensim to process and analyse theatrical works with the final goal of testing whether document clustering yields to the same classification given by literature experts.

Deck as presented at PyData Amsterdam 2016

6734a2665473e5fb2b6b7562e05bc1c3?s=128

Serena Peruzzo

March 14, 2016
Tweet

Transcript

  1. DATA DRIVEN LITERARY ANALYSIS: AN UNSUPERVISED APPROACH TO TEXT ANALYSIS

    AND CLASSIFICATION Serena Peruzzo PhD candidate at TU/e @sereprz s.peruzzo@tue.nl github.com/sereprz
  2. WHY AND WHAT? ➤ Natural Language Processing (NLP) ➤ interaction

    between natural and artificial languages ➤ e.g., machine translators, spam filters CAN NLP IDENTIFY DIFFERENT GENRES? 2
  3. SHAKESPEARE ANALYSIS 18 comedies 10 tragedies 11000+ words Two stages

    unsupervised approach Trials and Errors 3
  4. SUPERVISED DOCUMENT CLASSIFICATION 4

  5. UNSUPERVISED APPROACH 5

  6. FEATURE EXTRACTION ➤ a lot of information needs to be

    compressed and represented in simple data types tfidf(‘love’, ‘Romeo and Juliet’, ‘Shakespeare’s plays’) = 100 * ln(28/25) = 11.33 tfidf(‘Juliet’, ‘Romeo and Juliet’, ‘Shakespeare’s plays’) = 100 * ln(28/1) = 333.22 term frequency inverse document frequency 6
  7. LATENT DIRICHLET ALLOCATION ➤ N documents ➤ K probability distributions

    over a collection of words (topics) ➤ Formal statistical relationship ➤ bag-of-words assumption 7
  8. LDA - GENERATIVE MODEL ➤ For each document: 1. Select

    the number of words 2. Draw a distribution of topics 3. For each word in the document: i. Draw a specific topic ii. Draw a word from a multinomial probability conditioned on the topic 8
  9. LDA - EXAMPLE ➤ d is a 5-words document ➤

    Decide d will be 1/2 about cute animals and 1/2 about food ➤ topic:food, word:’broccoli’ ➤ topic:cute animals, word:‘panda’ ➤ topic:cute animals, word: ’baby’ ➤ topic:food, word: ’apple’ ➤ topic:food, word:’eating’ ➤ d = { broccoli, panda, baby, apple, eating} 9
  10. 10

  11. K-MEANS CLUSTERING ➤ Unsupervised ➤ K groups ➤ minimise variability

    within each cluster ➤ maximise variability between clusters 11
  12. Complex plot (twists) Mistaken identities Language (puns, creative insults) Love

    Happy ending Noble hero with a tragic flaw that leads to a tragic fall Supernatural element Death 12
  13. PRE-PROCESSING AND ANALYSIS nltk 13 lda + scikit-learn

  14. 14 play

  15. common words death love hero 15

  16. TOPICS AVERAGES WITHIN GROUPS death common love hero 16

  17. K-MEANS GROUPING VS TRADITIONAL CLASSIFICATION Group 0 Group 1 Twelfth

    night, The Merchant of Venice, Love’s Labour’s Lost, Much ado About Nothing, Taming of the Shrew, As You Like it, Merry Wives of Windsor, Midsummer Night’s Dream, Romeo and Juliet, Comedy of Errors, Two Gentlemen of Verona Titus Andronicus, All’s Well What Ends Well, Macbeth, Hamlet, Antony and Cleopatra, King Lear, Julius Caesar, Tempest, Winter’s Tale, Timon of Athens, Coriolanus, Troilus and Cressida, Measure for Measure, Cymbeline, Othello, Pericle Prince of Persia 17
  18. YEARS THE PLAYS WERE PERFORMED FOR THE FIRST TIME 18

  19. WRAP UP ➤ Can’t find comedies VS tragedies ➤ Can

    use NLP for literary analysis ➤ Let the data tell their story 19
  20. code: github.com/sereprz/ShakespeareTextAnalysis THANKS FOR LISTENING QUESTIONS? 20