A Crash Course in Natural Language Processing

A Crash Course in Natural Language Processing Oliver Mason Phrasys
Ltd http://phrasys.net @ojmason [email protected] Basic introduction given at BrightonSEO conference September 2013

Engineering Artiﬁcial Intelligence Computer Science Philosophy Linguistics Natural Language Processing
NLP is a very interdisciplinary subject

Machine Translation 1954 Experiment Georgetown Experiment: 20 Russian sentences are
translated into English.

1966 ALPAC Report http://www.nap.edu/openbook.php?record_id=9547 Funding for NLP is cut after
evaluation shows little progress – Start of “AI Winter”

1980’s Statistical NLP Resurgence of interest in NLP as statistical
approaches turn out to be quite successful in getting things done. For example Jelinek’s work in speech recognition at IBM

Rules vs Stats • hand-crafted • accurate • small coverage
• brittle • language-speciﬁc • learned from data • accurate enough • broad coverage • robust • language-independent

General Procedure •pre-process text •classify words (look-up) •group words into
phrases •apply application logic

Tokenisation • Splits a stream of characters into tokens •
Deals with words and punctuation • Problems: what is a word? • don’t / cannot / gonna / ... • hyphenation, multi-word units

Part-of-Speech Tagging • assign word categories to tokens • dictionary
look-up or guessing • rules/probabilities for disambiguation • Problems: multiple possibilities • light: noun / verb / adjective

Named Entities • Look up names in a gazetteer •
people, places, organisations, ... • Identify referential expressions • Problems: opaque references • The minister said that ...

Parsing •combine words into larger phrases •based on categories/word classes
•uses rules or patterns (chunking)

Semantic Analysis • extract meaning from text • using words
or phrases • single words often don’t have meaning • usually the goal of NLP applications

Applications

Machine Translation Problem: training data (translated texts) is contaminated by
people putting automatically translated texts on the web, where Google picks them up and feeds them back into the system. MT is quite successful, but still needs human post-editing as it is not publication quality.

Information Retrieval One of the earlier applications of NLP is
still improving through use of more advanced techniques.

Sentiment Analysis SA is assigning positive/ negative assessments to statements
– language expresses more than just facts, but also values. Problems: most current approaches are based on single words, and have difficulties with context, eg “This phone is not good” would be seen as positive due to “good” Also: the computer cannot do sarcasm. Or irony. “Phone X is brilliant when all you want to do is use it as a paperweight.”

Information Extraction In information retrieval you want to find documents
to read, here you just want to find out relevant information, such as cinema opening times.

Question Answering Here a user poses a question in English,
the computer extracts what it is they are asking for, and goes to find the answer.

Speech Synthesis Speech Recognition Somewhat related to NLP; usually NLP
deals with written (or readily transcribed) texts. A familiar example would be Siri on iOS.

Tools •you can write your own •resources are the bottleneck
•many tools exist •off-the-shelf frameworks available •problem is suitability suitability for your actual project that is Algorithms are not the issue, but language resources for training and testing are generally harder to get hold of

Commonly used: •GATE •developed at Shefﬁeld University •plug-in architecture •implemented
in Java This is not an exhaustive list – just two I am a bit familiar with!

•NLTK •implemented in Python •easy to use •very extensive range
of tools

• extract recipes from web • identify ingredients and quantities
• using GATE for structural analysis Whisk http://www.whisk.co.uk

• NLP is a useful addition to the tool set
• NLP has many applications • Many tools are freely available • Performance is generally ‘good enough’ Conclusion

Thank you! @ojmason Images: Wikipedia

A Crash Course in Natural Language Processing

A Crash Course in Natural Language Processing

Oliver Mason

Other Decks in Technology

Featured

Transcript

A Crash Course in Natural Language Processing Oliver Mason Phrasys

Engineering Artiﬁcial Intelligence Computer Science Philosophy Linguistics Natural Language Processing

Machine Translation 1954 Experiment Georgetown Experiment: 20 Russian sentences are

1966 ALPAC Report http://www.nap.edu/openbook.php?record_id=9547 Funding for NLP is cut after

1980’s Statistical NLP Resurgence of interest in NLP as statistical

Rules vs Stats • hand-crafted • accurate • small coverage

General Procedure •pre-process text •classify words (look-up) •group words into

Tokenisation • Splits a stream of characters into tokens •

Part-of-Speech Tagging • assign word categories to tokens • dictionary

Named Entities • Look up names in a gazetteer •

Parsing •combine words into larger phrases •based on categories/word classes

Semantic Analysis • extract meaning from text • using words

Applications

Machine Translation Problem: training data (translated texts) is contaminated by

Information Retrieval One of the earlier applications of NLP is

Sentiment Analysis SA is assigning positive/ negative assessments to statements

Information Extraction In information retrieval you want to find documents

Question Answering Here a user poses a question in English,

Speech Synthesis Speech Recognition Somewhat related to NLP; usually NLP

Tools •you can write your own •resources are the bottleneck

Commonly used: •GATE •developed at Shefﬁeld University •plug-in architecture •implemented

•NLTK •implemented in Python •easy to use •very extensive range

• extract recipes from web • identify ingredients and quantities

• NLP is a useful addition to the tool set

Thank you! @ojmason Images: Wikipedia