A Crash Course in Natural Language Processing Oliver Mason Phrasys Ltd http://phrasys.net @ojmason [email protected] Basic introduction given at BrightonSEO conference September 2013
1980’s Statistical NLP Resurgence of interest in NLP as statistical approaches turn out to be quite successful in getting things done. For example Jelinek’s work in speech recognition at IBM
Rules vs Stats • hand-crafted • accurate • small coverage • brittle • language-specific • learned from data • accurate enough • broad coverage • robust • language-independent
Tokenisation • Splits a stream of characters into tokens • Deals with words and punctuation • Problems: what is a word? • don’t / cannot / gonna / ... • hyphenation, multi-word units
Named Entities • Look up names in a gazetteer • people, places, organisations, ... • Identify referential expressions • Problems: opaque references • The minister said that ...
Machine Translation Problem: training data (translated texts) is contaminated by people putting automatically translated texts on the web, where Google picks them up and feeds them back into the system. MT is quite successful, but still needs human post-editing as it is not publication quality.
Sentiment Analysis SA is assigning positive/ negative assessments to statements – language expresses more than just facts, but also values. Problems: most current approaches are based on single words, and have difficulties with context, eg “This phone is not good” would be seen as positive due to “good” Also: the computer cannot do sarcasm. Or irony. “Phone X is brilliant when all you want to do is use it as a paperweight.”
Information Extraction In information retrieval you want to find documents to read, here you just want to find out relevant information, such as cinema opening times.
Speech Synthesis Speech Recognition Somewhat related to NLP; usually NLP deals with written (or readily transcribed) texts. A familiar example would be Siri on iOS.
Tools •you can write your own •resources are the bottleneck •many tools exist •off-the-shelf frameworks available •problem is suitability suitability for your actual project that is Algorithms are not the issue, but language resources for training and testing are generally harder to get hold of
Commonly used: •GATE •developed at Sheffield University •plug-in architecture •implemented in Java This is not an exhaustive list – just two I am a bit familiar with!
• NLP is a useful addition to the tool set • NLP has many applications • Many tools are freely available • Performance is generally ‘good enough’ Conclusion