NLTK 3.0

NLTK Mohammed Shokr 16-Mar-16

Natural Language Toolkit (NLTK) ▪ A collection of Python programs,
modules, data set and tutorial to support research and development in Natural Language Processing (NLP) ▪ Written by Steven Bird, Edvard Loper and Ewan Klien ▪ NLTK is – Free and Open source – Easy to use – Modular – Well documented – Simple and extensible

Installation of NLTK ▪ NLTK requires Python versions 2.7 or
3.2+

Python REPL • Read Eval Print Loop • REPL: This
is a procedure that just loops, accepts one command at a time, executing it, and printing the result. • GUI OR CLI

Installation of NLTK 1. Start a Command Prompt as an
Administrator ( Windows User ) 1. Click Start. 2. In the Start Search box, type cmd, and then press CTRL+SHIFT+ENTER. 3. If the User Account Control dialog box appears, confirm that the action it displays is what you want, and then click Continue. 2. changing from user to Superuser ( linux user ) – sudo su

Installation of NLTK Install NLTK: run pip install nltk

Test installation run python then type import nltk

Installing NLTK Data ▪ NLTK comes with many corpora, toy
grammars, trained models, etc. A complete list is posted at: http://nltk.org/nltk_data/ ▪ Run the Python REPL and type the commands: ▪ A new window should open, showing the NLTK Downloader. Click on the File menu and select Change Download Directory. For central installation, set this to C:\nltk_data (Windows), /usr/local/share/nltk_data (Mac), or /usr/share/nltk_data (Unix). Next, select the packages or collections you want to download. >>> import nltk >>> nltk.download()

NLTK Downloader

Getting Started with NLTK

NLTK Text

Text Pre-processing

Sentence splitter from nltk.tokenize import sent_tokenize input_sting = ‘Hello Every
One. How Are you ? life is not easy.’ all_sent = sent_tokenize(input_sting) print (all_sent)

Tokenization sent = "Hi Everyone ! How do you do
?" # Split() built-in string function print (sent.split()) # word_tokenize from nltk.tokenize import word_tokenize print (word_tokenize(sent))

Stemming

Lemmatization

Morphology

Edit-Distance We can create a very basic spellchecker by just
using a dictionary lookup.

Part of Speech Tagging

Penn Bank Part-of-Speech Tags

Part of Speech Tagging ▪ Stanford tagger ▪ N-gram tagger
▪ Regex tagger ▪ Brill tagger ▪ Machine learning based tagger ▪ NER tagger – Named Entity Recognition (NER) – NLTK provides the ne_chunk() method

NER tagger  Tokens  Tagged  Entities

Parsing Structure in Text

Shallow VS deep parsing ▪ In deep or full parsing,
typically, grammar concepts such as CFG, and probabilistic context-free grammar (PCFG), and a search strategy is used to give a complete syntactic structure to a sentence. ▪ Shallow parsing is the task of parsing a limited part of the syntactic information from the given text.

The two approaches in parsing The rule-based approach The probabilistic
approach This approach is based on rules/grammar In this approach, you learn rules/grammar by using probabilistic models Manual grammatical rules are coded down in CFG, and so on, in this approach This uses observed probabilities of linguistic features This has a top-down approach This has a bottom-up approach This approach includes CFG and Regexbased parser This approach includes PCFG and the Stanford parser

context-free grammar (CFG) ▪ Generating sentences from context-free grammars:

Different types of parsers ▪ Recursive descent parser ▪ Shift-reduce
parser ▪ Chart parser ▪ Regex parser ▪ Dependency parsing

Different types of parsers Recursive descent parser One of the
most straightforward forms of parsing is recursive descent parsing. This is a top-down process in which the parser attempts to verify that the syntax of the input stream is correct, as it is read from left to right. Shift-reduce parser The shift-reduce parser is a simple kind of bottom-up parser. Chart parser We will apply the algorithm design technique of dynamic programming to the parsing problem. Regex parser A regex parser uses a regular expression defined in the form of grammar on top of a POS-tagged string. Dependency parsing Dependency parsing (DP) is a modern parsing mechanism. The main concept of DP is that each linguistic unit (words) is connected with each other by a directed link.

Chunking ▪ Chunking is shallow parsing where instead of reaching
out to the deep structure of the sentence, we try to club some chunks of the sentences that constitute some meaning. ▪ For example, the sentence "the President speaks about the health care reforms"

So, let's write some code snippets to do some basic
chunking:

Display a parse tree # import treebank corpus from nltk.corpus
import treebank t = treebank.parsed_sents('wsj_0001.mrg')[0] t.draw()

Relation Extraction Output

Resources ▪ NLTK 3.0 documentation – http://www.nltk.org/ ▪ NLTK Essentials
– https://www.packtpub.com/big-data-and-business-intelligence/nltk-essentials ▪ nltk_tutorial_repo [Code] – https://git.io/vaRIR

Thanks. Questions? Send me an email! (I love talking!) Mohammed
Shokr [email protected] @MShokr1

NLTK 3.0

NLTK 3.0

More Decks by Mohammed Shokr

Other Decks in Programming

Featured

Transcript