Slide 1

Slide 1 text

Keyword Extraction ... or why NLP matters!

Slide 2

Slide 2 text

Konstantin Tennhard Ruby Developer at flinc Hi, I‘m…

Slide 3

Slide 3 text

Konstantin Tennhard Ruby Developer at flinc Hi, I‘m… Ruby enthusiast Bartender Computer Science student Photographer Mountain bike addict Computer Linguist

Slide 4

Slide 4 text

Problem What are we talking about?

Slide 5

Slide 5 text

The Occupy Wall Street movement began in Zuccotti Park on a glorious mid-September Saturday and, so far, many of its larger marches have taken place in the warmth of New York's Indian summer. But winter has been looming, and on Saturday, just a couple days before Halloween, the protesters got a preview of what they're in for.

Slide 6

Slide 6 text

The Occupy Wall Street movement began in Zuccotti Park on a glorious mid-September Saturday and, so far, many of its larger marches have taken place in the warmth of New York's Indian summer. But winter has been looming, and on Saturday, just a couple days before Halloween, the protesters got a preview of what they're in for.

Slide 7

Slide 7 text

The Occupy Wall Street movement began in Zuccotti Park on a glorious mid-September Saturday and, so far, many of its larger marches have taken place in the warmth of New York's Indian summer. But winter has been looming, and on Saturday, just a couple days before Halloween, the protesters got a preview of what they're in for.

Slide 8

Slide 8 text

The Occupy Wall Street movement began in Zuccotti Park on a glorious mid-September Saturday and, so far, many of its larger marches have taken place in the warmth of New York's Indian summer. But winter has been looming, and on Saturday, just a couple days before Halloween, the protesters got a preview of what they're in for.

Slide 9

Slide 9 text

The Occupy Wall Street movement began in Zuccotti Park on a glorious mid-September Saturday and, so far, many of its larger marches have taken place in the warmth of New York's Indian summer. But winter has been looming, and on Saturday, just a couple days before Halloween, the protesters got a preview of what they're in for.

Slide 10

Slide 10 text

The Occupy Wall Street movement began in Zuccotti Park on a glorious mid-September Saturday and, so far, many of its larger marches have taken place in the warmth of New York's Indian summer. But winter has been looming, and on Saturday, just a couple days before Halloween, the protesters got a preview of what they're in for.

Slide 11

Slide 11 text

The Occupy Wall Street movement began in Zuccotti Park on a glorious mid-September Saturday and, so far, many of its larger marches have taken place in the warmth of New York's Indian summer. But winter has been looming, and on Saturday, just a couple days before Halloween, the protesters got a preview of what they're in for.

Slide 12

Slide 12 text

The Occupy Wall Street movement began in Zuccotti Park on a glorious mid-September Saturday and, so far, many of its larger marches have taken place in the warmth of New York's Indian summer. But winter has been looming, and on Saturday, just a couple days before Halloween, the protesters got a preview of what they're in for. Named Entity Adjective Noun

Slide 13

Slide 13 text

Cooccurrence Don't be anxious ... ... it's not that bad

Slide 14

Slide 14 text

... protesters got a preview ... Word window Cooccurrence

Slide 15

Slide 15 text

What do we want?

Slide 16

Slide 16 text

What do we want? Well, how about extracting nouns and adjectives that cooccur in word windows of a certain size? Wouldn't that be something.

Slide 17

Slide 17 text

But ... ... let's talk about something else first.

Slide 18

Slide 18 text

Natural Language Processing ... and the sad state of Ruby libraries

Slide 19

Slide 19 text

NLP What is it?

Slide 20

Slide 20 text

NLP Language analysis Compute intensive tasks Large data sets Machine learning State-of-the-art Science

Slide 21

Slide 21 text

NLP tasks What is NLP good for?

Slide 22

Slide 22 text

NLP tasks Keyword extraction Opinion mining Text summarization Text classification Machine Translation Named Entity Recognition

Slide 23

Slide 23 text

Sounds fancy, doesn't it?

Slide 24

Slide 24 text

But ... ... before you can do the fancy stuff, you need to do a couple of other things first!

Slide 25

Slide 25 text

NLP Tasks POS Tagging Stemming Chunking Segmentation Lemmatization Tokenization The cranky ones.

Slide 26

Slide 26 text

NLP Pipeline Sentence splitting Tokenization POS tagging Stemming / Lemmatization Clean up Fancy stuff

Slide 27

Slide 27 text

Ruby Libraries Well, there isn't really much available!

Slide 28

Slide 28 text

Keyword Extraction Right,that's what we are actually talking about here.

Slide 29

Slide 29 text

Solution How can we solve this problem?

Slide 30

Slide 30 text

Three Steps Generate Cooccurrence Graph Part-of-Speech Tagging Stemming Apply Weighted PageRank Extract Nodes with Highest Rank 1 2 3

Slide 31

Slide 31 text

The algorithm is known as TextRank and has been published by Rada Mihalcea and Paul Tarau in 2004 in their Paper “TextRank: Bringing Order into Texts” http:/ /acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mihalcea.pdf

Slide 32

Slide 32 text

... protesters got a preview ... Word window Cooccurrence

Slide 33

Slide 33 text

... protesters/NN got/VB a/DET preview/NN ... POS Tags

Slide 34

Slide 34 text

Street Wall movement Zuccotti Park gloriou mid-Septemb Saturdai larger mani march place warmth New york Indian summer winter coupl dai Halloween protest preview Cooccurrence Graph

Slide 35

Slide 35 text

Weighted PageRank WS(Vi) = (1 d) + d · X Vj 2In(Vi ) wji P Vk 2Out(Vj ) wjk WS(Vj) We will use Lexicographer's Pointwise Mutual Information as weighting function.

Slide 36

Slide 36 text

Code Well, how do we really solve this? Show me some code!

Slide 37

Slide 37 text

Code •Simple interface •Poor performance - this is just an academic example! text = "..." # your text count = 5 # number of words to extract KeywordExtractor. extract_most_important_words(text, count)

Slide 38

Slide 38 text

Demo Yeah, it actually works!

Slide 39

Slide 39 text

Questions?

Slide 40

Slide 40 text

github.com/t6d/keyword_extractor Purely developed for academic purposes. This library is by far not production ready.

Slide 41

Slide 41 text

Thanks! W e‘re hiring! @t6d github.com/t6d [email protected]