Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Natural Language Processing for Social Media

ianozsvald
March 07, 2013

Introduction to Natural Language Processing for Social Media

Brief internal (but public) presentation on the state of the art for NLP on social media for sentiment classification and entity recognition. No speaker notes (sorry, that's all in my head). Originally given at https://www.adaptivelab.com/

ianozsvald

March 07, 2013
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. [email protected] @IanOzsvald (A very brief) Introduction to (A very brief)

    Introduction to Natural Language Processing for Natural Language Processing for Social Media Social Media AdaptiveLab.com Feb 2013
  2. [email protected] @IanOzsvald History History • 1950s Russian/English translation • Chomsky

    transformative grammar • 1960s+ ELIZA PARRY SHRDLU • 1980s+ machine learning • Chomsky vs Norvig • 2010s “Deep Learning ANNs” (vision) • Google's 10^12 (mil.mil.) word corpus
  3. [email protected] @IanOzsvald Uses of NLP Uses of NLP • Search

    • Classification (gender/spam/stories) • Recommendation (Netflix/Amazon) • Translation • Named Entity Recognition • Sentiment Analysis • Fake review detection • Author/duplicate/plagiarism detection
  4. [email protected] @IanOzsvald Parsing Parsing • Rules of grammar (Chomsky) •

    Hand written rules • “human friendly” interfaces + translation • Machine learned • <img via WikiPedia>
  5. [email protected] @IanOzsvald Stemming/Lemmatization Stemming/Lemmatization • Runs/runner/running->run (dictionary defn) • Hand

    written rules to e.g. suffix strip • Snowball (“strippergram”) multi-lang • Let's us reduce sparsity • “godly”->”godli” • Add Part of Speech tags->lemmatizer
  6. [email protected] @IanOzsvald Part of Speech tags Part of Speech tags

    • [('I', 'PRP'), ('use', 'VBP'), ('my', 'PRP$'), ('apple', 'NN'), ('iphone', 'NN')] • <demo in NLTK> • Were hand written, now machine learned • How about these? – “I like to ski” – “I like my ski” – “I like the taste of ski”
  7. [email protected] @IanOzsvald Classifiers Classifiers • Rule based – Decision Trees

    (boolean) • Statistical relationships – Naive Bayes • Supervised – spam, story types • Unsupervised – distance metric (sex) • Word senses – Citroën (brand), Car (n) – Coke (snort, make, drink) – Revolution (attend? watch? enjoy?)
  8. [email protected] @IanOzsvald Named Entity Recognition Named Entity Recognition • Part

    of Speech+hand made rules • Requires labelled corpus for ML e.g. wikipedia • Bag of words: – “apple eat i my” – “apple i iphone my use” – “buying apple iphone then eat” - boolean rule with single class output? • Tweets don't look like Wikipedia articles...
  9. [email protected] @IanOzsvald Sentiment Analysis Sentiment Analysis • Mainly for reviews

    & news • Aprox. 80% human agreement • Lexicons & referents “Bought Blackberry. Love it” • Negations “I don't love my Blackberry” • “complex plot” vs “complex instructions” • Sarcasm/humour, nuanced sentences • :-) labels :-(
  10. [email protected] @IanOzsvald Complications in Tweets Complications in Tweets • Short/compressed

    communication • Poor grammar (PoS harder!) • Poor spelling (sparse refs) • Localised context (jargon, current events) • Capitalisation (autocorrect!) weak clues: – “That awkward moment when you playin I luv dem strippers on your iPod and the whole class can hear it”
  11. [email protected] @IanOzsvald Possibilities Possibilities • Custom brand classifier • Rewrite

    poor English: – “luuuving iph5, battery gd, scratched it lol” -> – “Loving iPhone 5 and battery is good and I scratched it LOL” • Fix repeated letters, expand references, add capitals, remove URLS etc
  12. [email protected] @IanOzsvald Other human languages? Other human languages? • How

    do we proof read? • What slang/localisations will be used in compressed tweets? • Are there other ways to detect brands/people and sentiment? (Chinese- >emoticons?) • >20 dialects of Arabic