Introduction to Natural Language Processing for Social Media

[email protected] @IanOzsvald (A very brief) Introduction to (A very brief)
Introduction to Natural Language Processing for Natural Language Processing for Social Media Social Media AdaptiveLab.com Feb 2013

[email protected] @IanOzsvald History History • 1950s Russian/English translation • Chomsky
transformative grammar • 1960s+ ELIZA PARRY SHRDLU • 1980s+ machine learning • Chomsky vs Norvig • 2010s “Deep Learning ANNs” (vision) • Google's 10^12 (mil.mil.) word corpus

[email protected] @IanOzsvald Uses of NLP Uses of NLP • Search
• Classification (gender/spam/stories) • Recommendation (Netflix/Amazon) • Translation • Named Entity Recognition • Sentiment Analysis • Fake review detection • Author/duplicate/plagiarism detection

[email protected] @IanOzsvald Parsing Parsing • Rules of grammar (Chomsky) •
Hand written rules • “human friendly” interfaces + translation • Machine learned • <img via WikiPedia>

[email protected] @IanOzsvald Stemming/Lemmatization Stemming/Lemmatization • Runs/runner/running->run (dictionary defn) • Hand
written rules to e.g. suffix strip • Snowball (“strippergram”) multi-lang • Let's us reduce sparsity • “godly”->”godli” • Add Part of Speech tags->lemmatizer

[email protected] @IanOzsvald Part of Speech tags Part of Speech tags
• [('I', 'PRP'), ('use', 'VBP'), ('my', 'PRP$'), ('apple', 'NN'), ('iphone', 'NN')] • <demo in NLTK> • Were hand written, now machine learned • How about these? – “I like to ski” – “I like my ski” – “I like the taste of ski”

[email protected] @IanOzsvald Classifiers Classifiers • Rule based – Decision Trees
(boolean) • Statistical relationships – Naive Bayes • Supervised – spam, story types • Unsupervised – distance metric (sex) • Word senses – Citroën (brand), Car (n) – Coke (snort, make, drink) – Revolution (attend? watch? enjoy?)

[email protected] @IanOzsvald Named Entity Recognition Named Entity Recognition • Part
of Speech+hand made rules • Requires labelled corpus for ML e.g. wikipedia • Bag of words: – “apple eat i my” – “apple i iphone my use” – “buying apple iphone then eat” - boolean rule with single class output? • Tweets don't look like Wikipedia articles...

[email protected] @IanOzsvald Sentiment Analysis Sentiment Analysis • Mainly for reviews
& news • Aprox. 80% human agreement • Lexicons & referents “Bought Blackberry. Love it” • Negations “I don't love my Blackberry” • “complex plot” vs “complex instructions” • Sarcasm/humour, nuanced sentences • :-) labels :-(

[email protected] @IanOzsvald Complications in Tweets Complications in Tweets • Short/compressed
communication • Poor grammar (PoS harder!) • Poor spelling (sparse refs) • Localised context (jargon, current events) • Capitalisation (autocorrect!) weak clues: – “That awkward moment when you playin I luv dem strippers on your iPod and the whole class can hear it”

[email protected] @IanOzsvald Possibilities Possibilities • Custom brand classifier • Rewrite
poor English: – “luuuving iph5, battery gd, scratched it lol” -> – “Loving iPhone 5 and battery is good and I scratched it LOL” • Fix repeated letters, expand references, add capitals, remove URLS etc

[email protected] @IanOzsvald Other human languages? Other human languages? • How
do we proof read? • What slang/localisations will be used in compressed tweets? • Are there other ways to detect brands/people and sentiment? (Chinese- >emoticons?) • >20 dialects of Arabic

Introduction to Natural Language Processing for...

Introduction to Natural Language Processing for Social Media

ianozsvald

More Decks by ianozsvald

Other Decks in Technology

Featured

Transcript

[email protected] @IanOzsvald (A very brief) Introduction to (A very brief)

[email protected] @IanOzsvald History History • 1950s Russian/English translation • Chomsky

[email protected] @IanOzsvald Uses of NLP Uses of NLP • Search

[email protected] @IanOzsvald Parsing Parsing • Rules of grammar (Chomsky) •

[email protected] @IanOzsvald Stemming/Lemmatization Stemming/Lemmatization • Runs/runner/running->run (dictionary defn) • Hand

[email protected] @IanOzsvald Part of Speech tags Part of Speech tags

[email protected] @IanOzsvald Classifiers Classifiers • Rule based – Decision Trees

[email protected] @IanOzsvald Named Entity Recognition Named Entity Recognition • Part

[email protected] @IanOzsvald Sentiment Analysis Sentiment Analysis • Mainly for reviews

[email protected] @IanOzsvald Complications in Tweets Complications in Tweets • Short/compressed

[email protected] @IanOzsvald Possibilities Possibilities • Custom brand classifier • Rewrite

[email protected] @IanOzsvald Other human languages? Other human languages? • How