Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NLTK 3.0

NLTK 3.0

Natural Language Toolkit — NLTK 3.0
NLTK is a leading platform for building Python programs to work with human language data.

Code :: https://github.com/Shokr/nltk_tutorial

Avatar for Mohammed Shokr

Mohammed Shokr

March 16, 2016
Tweet

More Decks by Mohammed Shokr

Other Decks in Programming

Transcript

  1. Natural Language Toolkit (NLTK) ▪ A collection of Python programs,

    modules, data set and tutorial to support research and development in Natural Language Processing (NLP) ▪ Written by Steven Bird, Edvard Loper and Ewan Klien ▪ NLTK is – Free and Open source – Easy to use – Modular – Well documented – Simple and extensible
  2. Python REPL • Read Eval Print Loop • REPL: This

    is a procedure that just loops, accepts one command at a time, executing it, and printing the result. • GUI OR CLI
  3. Installation of NLTK 1. Start a Command Prompt as an

    Administrator ( Windows User ) 1. Click Start. 2. In the Start Search box, type cmd, and then press CTRL+SHIFT+ENTER. 3. If the User Account Control dialog box appears, confirm that the action it displays is what you want, and then click Continue. 2. changing from user to Superuser ( linux user ) – sudo su
  4. Installing NLTK Data ▪ NLTK comes with many corpora, toy

    grammars, trained models, etc. A complete list is posted at: http://nltk.org/nltk_data/ ▪ Run the Python REPL and type the commands: ▪ A new window should open, showing the NLTK Downloader. Click on the File menu and select Change Download Directory. For central installation, set this to C:\nltk_data (Windows), /usr/local/share/nltk_data (Mac), or /usr/share/nltk_data (Unix). Next, select the packages or collections you want to download. >>> import nltk >>> nltk.download()
  5. Sentence splitter from nltk.tokenize import sent_tokenize input_sting = ‘Hello Every

    One. How Are you ? life is not easy.’ all_sent = sent_tokenize(input_sting) print (all_sent)
  6. Tokenization sent = "Hi Everyone ! How do you do

    ?" # Split() built-in string function print (sent.split()) # word_tokenize from nltk.tokenize import word_tokenize print (word_tokenize(sent))
  7. Part of Speech Tagging ▪ Stanford tagger ▪ N-gram tagger

    ▪ Regex tagger ▪ Brill tagger ▪ Machine learning based tagger ▪ NER tagger – Named Entity Recognition (NER) – NLTK provides the ne_chunk() method
  8. Shallow VS deep parsing ▪ In deep or full parsing,

    typically, grammar concepts such as CFG, and probabilistic context-free grammar (PCFG), and a search strategy is used to give a complete syntactic structure to a sentence. ▪ Shallow parsing is the task of parsing a limited part of the syntactic information from the given text.
  9. The two approaches in parsing The rule-based approach The probabilistic

    approach This approach is based on rules/grammar In this approach, you learn rules/grammar by using probabilistic models Manual grammatical rules are coded down in CFG, and so on, in this approach This uses observed probabilities of linguistic features This has a top-down approach This has a bottom-up approach This approach includes CFG and Regexbased parser This approach includes PCFG and the Stanford parser
  10. Different types of parsers ▪ Recursive descent parser ▪ Shift-reduce

    parser ▪ Chart parser ▪ Regex parser ▪ Dependency parsing
  11. Different types of parsers Recursive descent parser One of the

    most straightforward forms of parsing is recursive descent parsing. This is a top-down process in which the parser attempts to verify that the syntax of the input stream is correct, as it is read from left to right. Shift-reduce parser The shift-reduce parser is a simple kind of bottom-up parser. Chart parser We will apply the algorithm design technique of dynamic programming to the parsing problem. Regex parser A regex parser uses a regular expression defined in the form of grammar on top of a POS-tagged string. Dependency parsing Dependency parsing (DP) is a modern parsing mechanism. The main concept of DP is that each linguistic unit (words) is connected with each other by a directed link.
  12. Chunking ▪ Chunking is shallow parsing where instead of reaching

    out to the deep structure of the sentence, we try to club some chunks of the sentences that constitute some meaning. ▪ For example, the sentence "the President speaks about the health care reforms"
  13. Display a parse tree # import treebank corpus from nltk.corpus

    import treebank t = treebank.parsed_sents('wsj_0001.mrg')[0] t.draw()
  14. Resources ▪ NLTK 3.0 documentation – http://www.nltk.org/ ▪ NLTK Essentials

    – https://www.packtpub.com/big-data-and-business-intelligence/nltk-essentials ▪ nltk_tutorial_repo [Code] – https://git.io/vaRIR