Introduction to Natural Language Processing for Social Media
Brief internal (but public) presentation on the state of the art for NLP on social media for sentiment classification and entity recognition. No speaker notes (sorry, that's all in my head). Originally given at https://www.adaptivelab.com/
written rules to e.g. suffix strip • Snowball (“strippergram”) multi-lang • Let's us reduce sparsity • “godly”->”godli” • Add Part of Speech tags->lemmatizer
• [('I', 'PRP'), ('use', 'VBP'), ('my', 'PRP$'), ('apple', 'NN'), ('iphone', 'NN')] • <demo in NLTK> • Were hand written, now machine learned • How about these? – “I like to ski” – “I like my ski” – “I like the taste of ski”
of Speech+hand made rules • Requires labelled corpus for ML e.g. wikipedia • Bag of words: – “apple eat i my” – “apple i iphone my use” – “buying apple iphone then eat” - boolean rule with single class output? • Tweets don't look like Wikipedia articles...
communication • Poor grammar (PoS harder!) • Poor spelling (sparse refs) • Localised context (jargon, current events) • Capitalisation (autocorrect!) weak clues: – “That awkward moment when you playin I luv dem strippers on your iPod and the whole class can hear it”
poor English: – “luuuving iph5, battery gd, scratched it lol” -> – “Loving iPhone 5 and battery is good and I scratched it LOL” • Fix repeated letters, expand references, add capitals, remove URLS etc
do we proof read? • What slang/localisations will be used in compressed tweets? • Are there other ways to detect brands/people and sentiment? (Chinese- >emoticons?) • >20 dialects of Arabic