Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Keywordfinder: automatic keyword extraction from text

lvsh
October 02, 2015

Keywordfinder: automatic keyword extraction from text

As an Insight Data Science Fellow, I completed a 3-week project that involved building a keyword extraction algorithm. Given a block of text as input, my algorithm selects keywords that describe what the text is about.

lvsh

October 02, 2015
Tweet

Other Decks in Technology

Transcript

  1. A machine learning model for keyword extraction Content Candidate keywords

    Feature extraction Keyword Classifier Keywords
  2. A machine learning model for keyword extraction Content Candidate keywords

    Feature extraction Keyword Classifier Keywords
  3. A machine learning model for keyword extraction Content Candidate keywords

    Feature extraction Keyword Classifier Keywords Brooklyn
  4. A machine learning model for keyword extraction Content Candidate keywords

    Feature extraction Keyword Classifier Keywords Term  length   Wikipedia  freq   TF-­‐IDF  score   ... Brooklyn
  5. A machine learning model for keyword extraction Content Candidate keywords

    Feature extraction Keyword Classifier Keywords Term  length   Wikipedia  freq   TF-­‐IDF  score   ... Brooklyn Logistic regression
  6. A machine learning model for keyword extraction Crowd500 500 news

    articles Human-annotated keywords 9:1 training-test split Content Candidate keywords Feature extraction Keyword Classifier Keywords Term  length   Wikipedia  freq   TF-­‐IDF  score   ... Brooklyn Logistic regression
  7. A machine learning model for keyword extraction Content Candidate keywords

    Feature extraction Keyword Classifier Keywords Term  length   Wikipedia  freq   TF-­‐IDF  score   ... Brooklyn Logistic regression
  8. A machine learning model for keyword extraction P(keyword) = 0.81

    Content Candidate keywords Feature extraction Keyword Classifier Keywords Term  length   Wikipedia  freq   TF-­‐IDF  score   ... Brooklyn Logistic regression
  9. A machine learning model for keyword extraction P(keyword) = 0.81

    Content Candidate keywords Feature extraction Keyword Classifier Keywords Term  length   Wikipedia  freq   TF-­‐IDF  score   ... Brooklyn Logistic regression
  10. A machine learning model for keyword extraction P(keyword) = 0.81

    nltk Content Candidate keywords Feature extraction Keyword Classifier Keywords Term  length   Wikipedia  freq   TF-­‐IDF  score   ... Brooklyn Logistic regression
  11. Keyword classifier “Brooklyn” Term frequency TF-IDF score Wikipedia frequency Term

    length Capitalized? Position in page Spread in page Named entity? Noun phrase? Ngram? Logistic regression model In-sample: 65%, out-of-sample: 65%, chance: 50%