Slide 1

Slide 1 text

A Crash Course in Natural Language Processing Oliver Mason Phrasys Ltd http://phrasys.net @ojmason [email protected] Basic introduction given at BrightonSEO conference September 2013

Slide 2

Slide 2 text

Engineering Artificial Intelligence Computer Science Philosophy Linguistics Natural Language Processing NLP is a very interdisciplinary subject

Slide 3

Slide 3 text

Machine Translation 1954 Experiment Georgetown Experiment: 20 Russian sentences are translated into English.

Slide 4

Slide 4 text

1966 ALPAC Report http://www.nap.edu/openbook.php?record_id=9547 Funding for NLP is cut after evaluation shows little progress – Start of “AI Winter”

Slide 5

Slide 5 text

1980’s Statistical NLP Resurgence of interest in NLP as statistical approaches turn out to be quite successful in getting things done. For example Jelinek’s work in speech recognition at IBM

Slide 6

Slide 6 text

Rules vs Stats • hand-crafted • accurate • small coverage • brittle • language-specific • learned from data • accurate enough • broad coverage • robust • language-independent

Slide 7

Slide 7 text

General Procedure •pre-process text •classify words (look-up) •group words into phrases •apply application logic

Slide 8

Slide 8 text

Tokenisation • Splits a stream of characters into tokens • Deals with words and punctuation • Problems: what is a word? • don’t / cannot / gonna / ... • hyphenation, multi-word units

Slide 9

Slide 9 text

Part-of-Speech Tagging • assign word categories to tokens • dictionary look-up or guessing • rules/probabilities for disambiguation • Problems: multiple possibilities • light: noun / verb / adjective

Slide 10

Slide 10 text

Named Entities • Look up names in a gazetteer • people, places, organisations, ... • Identify referential expressions • Problems: opaque references • The minister said that ...

Slide 11

Slide 11 text

Parsing •combine words into larger phrases •based on categories/word classes •uses rules or patterns (chunking)

Slide 12

Slide 12 text

Semantic Analysis • extract meaning from text • using words or phrases • single words often don’t have meaning • usually the goal of NLP applications

Slide 13

Slide 13 text

Applications

Slide 14

Slide 14 text

Machine Translation Problem: training data (translated texts) is contaminated by people putting automatically translated texts on the web, where Google picks them up and feeds them back into the system. MT is quite successful, but still needs human post-editing as it is not publication quality.

Slide 15

Slide 15 text

Information Retrieval One of the earlier applications of NLP is still improving through use of more advanced techniques.

Slide 16

Slide 16 text

Sentiment Analysis SA is assigning positive/ negative assessments to statements – language expresses more than just facts, but also values. Problems: most current approaches are based on single words, and have difficulties with context, eg “This phone is not good” would be seen as positive due to “good” Also: the computer cannot do sarcasm. Or irony. “Phone X is brilliant when all you want to do is use it as a paperweight.”

Slide 17

Slide 17 text

Information Extraction In information retrieval you want to find documents to read, here you just want to find out relevant information, such as cinema opening times.

Slide 18

Slide 18 text

Question Answering Here a user poses a question in English, the computer extracts what it is they are asking for, and goes to find the answer.

Slide 19

Slide 19 text

Speech Synthesis Speech Recognition Somewhat related to NLP; usually NLP deals with written (or readily transcribed) texts. A familiar example would be Siri on iOS.

Slide 20

Slide 20 text

Tools •you can write your own •resources are the bottleneck •many tools exist •off-the-shelf frameworks available •problem is suitability suitability for your actual project that is Algorithms are not the issue, but language resources for training and testing are generally harder to get hold of

Slide 21

Slide 21 text

Commonly used: •GATE •developed at Sheffield University •plug-in architecture •implemented in Java This is not an exhaustive list – just two I am a bit familiar with!

Slide 22

Slide 22 text

•NLTK •implemented in Python •easy to use •very extensive range of tools

Slide 23

Slide 23 text

• extract recipes from web • identify ingredients and quantities • using GATE for structural analysis Whisk http://www.whisk.co.uk

Slide 24

Slide 24 text

• NLP is a useful addition to the tool set • NLP has many applications • Many tools are freely available • Performance is generally ‘good enough’ Conclusion

Slide 25

Slide 25 text

Thank you! @ojmason Images: Wikipedia