Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Crash Course in Natural Language Processing

Oliver Mason
September 13, 2013

A Crash Course in Natural Language Processing

A talk given in the technical strand of BrightonSEO, September 2013

Oliver Mason

September 13, 2013
Tweet

Other Decks in Technology

Transcript

  1. A Crash Course in
    Natural Language
    Processing
    Oliver Mason
    Phrasys Ltd
    http://phrasys.net @ojmason [email protected]
    Basic introduction given at
    BrightonSEO conference
    September 2013

    View Slide

  2. Engineering
    Artificial
    Intelligence Computer
    Science
    Philosophy
    Linguistics
    Natural
    Language
    Processing
    NLP is a very
    interdisciplinary subject

    View Slide

  3. Machine Translation
    1954 Experiment
    Georgetown Experiment:
    20 Russian sentences are
    translated into English.

    View Slide

  4. 1966 ALPAC
    Report
    http://www.nap.edu/openbook.php?record_id=9547
    Funding for NLP is cut
    after evaluation shows
    little progress – Start of
    “AI Winter”

    View Slide

  5. 1980’s
    Statistical NLP
    Resurgence of interest in
    NLP as statistical
    approaches turn out to be
    quite successful in getting
    things done. For example
    Jelinek’s work in speech
    recognition at IBM

    View Slide

  6. Rules vs Stats
    • hand-crafted
    • accurate
    • small coverage
    • brittle
    • language-specific
    • learned from data
    • accurate enough
    • broad coverage
    • robust
    • language-independent

    View Slide

  7. General Procedure
    •pre-process text
    •classify words (look-up)
    •group words into phrases
    •apply application logic

    View Slide

  8. Tokenisation
    • Splits a stream of characters into tokens
    • Deals with words and punctuation
    • Problems: what is a word?
    • don’t / cannot / gonna / ...
    • hyphenation, multi-word units

    View Slide

  9. Part-of-Speech Tagging
    • assign word categories to tokens
    • dictionary look-up or guessing
    • rules/probabilities for disambiguation
    • Problems: multiple possibilities
    • light: noun / verb / adjective

    View Slide

  10. Named Entities
    • Look up names in a gazetteer
    • people, places, organisations, ...
    • Identify referential expressions
    • Problems: opaque references
    • The minister said that ...

    View Slide

  11. Parsing
    •combine words into larger phrases
    •based on categories/word classes
    •uses rules or patterns (chunking)

    View Slide

  12. Semantic Analysis
    • extract meaning from text
    • using words or phrases
    • single words often don’t have meaning
    • usually the goal of NLP applications

    View Slide

  13. Applications

    View Slide

  14. Machine Translation
    Problem: training data
    (translated texts) is
    contaminated by people
    putting automatically
    translated texts on the
    web, where Google picks
    them up and feeds them
    back into the system.
    MT is quite successful,
    but still needs human
    post-editing as it is not
    publication quality.

    View Slide

  15. Information Retrieval
    One of the earlier
    applications of NLP is still
    improving through use of
    more advanced techniques.

    View Slide

  16. Sentiment Analysis
    SA is assigning positive/
    negative assessments to
    statements – language
    expresses more than just
    facts, but also values.
    Problems: most current
    approaches are based on
    single words, and have
    difficulties with context,
    eg “This phone is not
    good” would be seen as
    positive due to “good”
    Also: the computer cannot
    do sarcasm. Or irony.
    “Phone X is brilliant when
    all you want to do is use it
    as a paperweight.”

    View Slide

  17. Information Extraction
    In information retrieval
    you want to find
    documents to read, here
    you just want to find out
    relevant information,
    such as cinema opening
    times.

    View Slide

  18. Question Answering
    Here a user poses a
    question in English, the
    computer extracts what
    it is they are asking for,
    and goes to find the
    answer.

    View Slide

  19. Speech Synthesis
    Speech Recognition
    Somewhat related to
    NLP; usually NLP deals
    with written (or readily
    transcribed) texts.
    A familiar example would
    be Siri on iOS.

    View Slide

  20. Tools
    •you can write your own
    •resources are the bottleneck
    •many tools exist
    •off-the-shelf frameworks available
    •problem is suitability
    suitability for your
    actual project that is
    Algorithms are not the
    issue, but language
    resources for training and
    testing are generally
    harder to get hold of

    View Slide

  21. Commonly used:
    •GATE
    •developed at Sheffield University
    •plug-in architecture
    •implemented in Java
    This is not an exhaustive
    list – just two I am a bit
    familiar with!

    View Slide

  22. •NLTK
    •implemented in Python
    •easy to use
    •very extensive range of tools

    View Slide

  23. • extract recipes from web
    • identify ingredients and
    quantities
    • using GATE for structural
    analysis
    Whisk
    http://www.whisk.co.uk

    View Slide

  24. • NLP is a useful addition to the tool set
    • NLP has many applications
    • Many tools are freely available
    • Performance is generally ‘good enough’
    Conclusion

    View Slide

  25. Thank you!
    @ojmason
    Images: Wikipedia

    View Slide