Slide 1

Slide 1 text

DATA DRIVEN LITERARY ANALYSIS: AN UNSUPERVISED APPROACH TO TEXT ANALYSIS AND CLASSIFICATION Serena Peruzzo PhD candidate at TU/e @sereprz [email protected] github.com/sereprz

Slide 2

Slide 2 text

WHY AND WHAT? ➤ Natural Language Processing (NLP) ➤ interaction between natural and artificial languages ➤ e.g., machine translators, spam filters CAN NLP IDENTIFY DIFFERENT GENRES? 2

Slide 3

Slide 3 text

SHAKESPEARE ANALYSIS 18 comedies 10 tragedies 11000+ words Two stages unsupervised approach Trials and Errors 3

Slide 4

Slide 4 text

SUPERVISED DOCUMENT CLASSIFICATION 4

Slide 5

Slide 5 text

UNSUPERVISED APPROACH 5

Slide 6

Slide 6 text

FEATURE EXTRACTION ➤ a lot of information needs to be compressed and represented in simple data types tfidf(‘love’, ‘Romeo and Juliet’, ‘Shakespeare’s plays’) = 100 * ln(28/25) = 11.33 tfidf(‘Juliet’, ‘Romeo and Juliet’, ‘Shakespeare’s plays’) = 100 * ln(28/1) = 333.22 term frequency inverse document frequency 6

Slide 7

Slide 7 text

LATENT DIRICHLET ALLOCATION ➤ N documents ➤ K probability distributions over a collection of words (topics) ➤ Formal statistical relationship ➤ bag-of-words assumption 7

Slide 8

Slide 8 text

LDA - GENERATIVE MODEL ➤ For each document: 1. Select the number of words 2. Draw a distribution of topics 3. For each word in the document: i. Draw a specific topic ii. Draw a word from a multinomial probability conditioned on the topic 8

Slide 9

Slide 9 text

LDA - EXAMPLE ➤ d is a 5-words document ➤ Decide d will be 1/2 about cute animals and 1/2 about food ➤ topic:food, word:’broccoli’ ➤ topic:cute animals, word:‘panda’ ➤ topic:cute animals, word: ’baby’ ➤ topic:food, word: ’apple’ ➤ topic:food, word:’eating’ ➤ d = { broccoli, panda, baby, apple, eating} 9

Slide 10

Slide 10 text

10

Slide 11

Slide 11 text

K-MEANS CLUSTERING ➤ Unsupervised ➤ K groups ➤ minimise variability within each cluster ➤ maximise variability between clusters 11

Slide 12

Slide 12 text

Complex plot (twists) Mistaken identities Language (puns, creative insults) Love Happy ending Noble hero with a tragic flaw that leads to a tragic fall Supernatural element Death 12

Slide 13

Slide 13 text

PRE-PROCESSING AND ANALYSIS nltk 13 lda + scikit-learn

Slide 14

Slide 14 text

14 play

Slide 15

Slide 15 text

common words death love hero 15

Slide 16

Slide 16 text

TOPICS AVERAGES WITHIN GROUPS death common love hero 16

Slide 17

Slide 17 text

K-MEANS GROUPING VS TRADITIONAL CLASSIFICATION Group 0 Group 1 Twelfth night, The Merchant of Venice, Love’s Labour’s Lost, Much ado About Nothing, Taming of the Shrew, As You Like it, Merry Wives of Windsor, Midsummer Night’s Dream, Romeo and Juliet, Comedy of Errors, Two Gentlemen of Verona Titus Andronicus, All’s Well What Ends Well, Macbeth, Hamlet, Antony and Cleopatra, King Lear, Julius Caesar, Tempest, Winter’s Tale, Timon of Athens, Coriolanus, Troilus and Cressida, Measure for Measure, Cymbeline, Othello, Pericle Prince of Persia 17

Slide 18

Slide 18 text

YEARS THE PLAYS WERE PERFORMED FOR THE FIRST TIME 18

Slide 19

Slide 19 text

WRAP UP ➤ Can’t find comedies VS tragedies ➤ Can use NLP for literary analysis ➤ Let the data tell their story 19

Slide 20

Slide 20 text

code: github.com/sereprz/ShakespeareTextAnalysis THANKS FOR LISTENING QUESTIONS? 20