Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data driven literary analysis: an unsupervised approach to text analysis and classification

Data driven literary analysis: an unsupervised approach to text analysis and classification

Unsupervised document classification addresses the problem of assigning categories to documents without the use of a training set or predefined categories. This is useful to enhance information retrieval, the basic assumption being that similar contents are also relevant to the same query. A similar assumption is made in literature to define literary genres and sub-genres, where works which share specific conventions in terms of form and content are described by the same genre.

The talk gives an overview of document clustering and its challenges, with a focus on dimensionality reduction and how to address it with topic modelling techniques like LDA (Latent Dirichlet Allocation). Using Shakespeare’s body of work as a case study, the talk describes how to use nltk, sklearn and gensim to process and analyse theatrical works with the final goal of testing whether document clustering yields to the same classification given by literature experts.

Deck as presented at PyData Amsterdam 2016

Serena Peruzzo

March 14, 2016

Other Decks in Technology



    AND CLASSIFICATION Serena Peruzzo PhD candidate at TU/e @sereprz [email protected] github.com/sereprz
  2. WHY AND WHAT? ➤ Natural Language Processing (NLP) ➤ interaction

    between natural and artificial languages ➤ e.g., machine translators, spam filters CAN NLP IDENTIFY DIFFERENT GENRES? 2
  3. FEATURE EXTRACTION ➤ a lot of information needs to be

    compressed and represented in simple data types tfidf(‘love’, ‘Romeo and Juliet’, ‘Shakespeare’s plays’) = 100 * ln(28/25) = 11.33 tfidf(‘Juliet’, ‘Romeo and Juliet’, ‘Shakespeare’s plays’) = 100 * ln(28/1) = 333.22 term frequency inverse document frequency 6
  4. LATENT DIRICHLET ALLOCATION ➤ N documents ➤ K probability distributions

    over a collection of words (topics) ➤ Formal statistical relationship ➤ bag-of-words assumption 7
  5. LDA - GENERATIVE MODEL ➤ For each document: 1. Select

    the number of words 2. Draw a distribution of topics 3. For each word in the document: i. Draw a specific topic ii. Draw a word from a multinomial probability conditioned on the topic 8
  6. LDA - EXAMPLE ➤ d is a 5-words document ➤

    Decide d will be 1/2 about cute animals and 1/2 about food ➤ topic:food, word:’broccoli’ ➤ topic:cute animals, word:‘panda’ ➤ topic:cute animals, word: ’baby’ ➤ topic:food, word: ’apple’ ➤ topic:food, word:’eating’ ➤ d = { broccoli, panda, baby, apple, eating} 9
  7. 10

  8. K-MEANS CLUSTERING ➤ Unsupervised ➤ K groups ➤ minimise variability

    within each cluster ➤ maximise variability between clusters 11
  9. Complex plot (twists) Mistaken identities Language (puns, creative insults) Love

    Happy ending Noble hero with a tragic flaw that leads to a tragic fall Supernatural element Death 12

    night, The Merchant of Venice, Love’s Labour’s Lost, Much ado About Nothing, Taming of the Shrew, As You Like it, Merry Wives of Windsor, Midsummer Night’s Dream, Romeo and Juliet, Comedy of Errors, Two Gentlemen of Verona Titus Andronicus, All’s Well What Ends Well, Macbeth, Hamlet, Antony and Cleopatra, King Lear, Julius Caesar, Tempest, Winter’s Tale, Timon of Athens, Coriolanus, Troilus and Cressida, Measure for Measure, Cymbeline, Othello, Pericle Prince of Persia 17
  11. WRAP UP ➤ Can’t find comedies VS tragedies ➤ Can

    use NLP for literary analysis ➤ Let the data tell their story 19