A Crash Course in Natural Language Processing

B3e5e6b5ecd2707930a109a46c0cfafe?s=47 Oliver Mason
September 13, 2013

A Crash Course in Natural Language Processing

A talk given in the technical strand of BrightonSEO, September 2013


Oliver Mason

September 13, 2013


  1. 1.

    A Crash Course in Natural Language Processing Oliver Mason Phrasys

    Ltd http://phrasys.net @ojmason oliver@phrasys.net Basic introduction given at BrightonSEO conference September 2013
  2. 4.

    1966 ALPAC Report http://www.nap.edu/openbook.php?record_id=9547 Funding for NLP is cut after

    evaluation shows little progress – Start of “AI Winter”
  3. 5.

    1980’s Statistical NLP Resurgence of interest in NLP as statistical

    approaches turn out to be quite successful in getting things done. For example Jelinek’s work in speech recognition at IBM
  4. 6.

    Rules vs Stats • hand-crafted • accurate • small coverage

    • brittle • language-specific • learned from data • accurate enough • broad coverage • robust • language-independent
  5. 8.

    Tokenisation • Splits a stream of characters into tokens •

    Deals with words and punctuation • Problems: what is a word? • don’t / cannot / gonna / ... • hyphenation, multi-word units
  6. 9.

    Part-of-Speech Tagging • assign word categories to tokens • dictionary

    look-up or guessing • rules/probabilities for disambiguation • Problems: multiple possibilities • light: noun / verb / adjective
  7. 10.

    Named Entities • Look up names in a gazetteer •

    people, places, organisations, ... • Identify referential expressions • Problems: opaque references • The minister said that ...
  8. 12.

    Semantic Analysis • extract meaning from text • using words

    or phrases • single words often don’t have meaning • usually the goal of NLP applications
  9. 14.

    Machine Translation Problem: training data (translated texts) is contaminated by

    people putting automatically translated texts on the web, where Google picks them up and feeds them back into the system. MT is quite successful, but still needs human post-editing as it is not publication quality.
  10. 15.

    Information Retrieval One of the earlier applications of NLP is

    still improving through use of more advanced techniques.
  11. 16.

    Sentiment Analysis SA is assigning positive/ negative assessments to statements

    – language expresses more than just facts, but also values. Problems: most current approaches are based on single words, and have difficulties with context, eg “This phone is not good” would be seen as positive due to “good” Also: the computer cannot do sarcasm. Or irony. “Phone X is brilliant when all you want to do is use it as a paperweight.”
  12. 17.

    Information Extraction In information retrieval you want to find documents

    to read, here you just want to find out relevant information, such as cinema opening times.
  13. 18.

    Question Answering Here a user poses a question in English,

    the computer extracts what it is they are asking for, and goes to find the answer.
  14. 19.

    Speech Synthesis Speech Recognition Somewhat related to NLP; usually NLP

    deals with written (or readily transcribed) texts. A familiar example would be Siri on iOS.
  15. 20.

    Tools •you can write your own •resources are the bottleneck

    •many tools exist •off-the-shelf frameworks available •problem is suitability suitability for your actual project that is Algorithms are not the issue, but language resources for training and testing are generally harder to get hold of
  16. 21.

    Commonly used: •GATE •developed at Sheffield University •plug-in architecture •implemented

    in Java This is not an exhaustive list – just two I am a bit familiar with!
  17. 23.

    • extract recipes from web • identify ingredients and quantities

    • using GATE for structural analysis Whisk http://www.whisk.co.uk
  18. 24.

    • NLP is a useful addition to the tool set

    • NLP has many applications • Many tools are freely available • Performance is generally ‘good enough’ Conclusion