Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
NLTK Intro for PUGS
Search
Victor Neo
March 27, 2012
Programming
7
590
NLTK Intro for PUGS
Slides for the NLTK talk given on March 2012 for Python User Group SG Meetup.
Victor Neo
March 27, 2012
Tweet
Share
More Decks by Victor Neo
See All by Victor Neo
Django - The Next Steps
victorneo
5
670
DevOps: Python tools to get started
victorneo
9
13k
Git and Python workshop
victorneo
2
800
Other Decks in Programming
See All in Programming
PHPのバージョンアップ時にも役立ったAST(2026年版)
matsuo_atsushi
0
270
Feature Toggle は捨てやすく使おう
gennei
0
380
我々はなぜ「層」を分けるのか〜「関心の分離」と「抽象化」で手に入れる変更に強いシンプルな設計〜 #phperkaigi / PHPerKaigi 2026
shogogg
2
680
Cyrius ーLinux非依存にコンテナをネイティブ実行する専用OSー
n4mlz
0
250
Agentic AI: Evolution oder Revolution
mobilelarson
PRO
0
210
Nostalgia Meets Technology: Super Mario with TypeScript
manfredsteyer
PRO
0
110
Kubernetesでセルフホストが簡単なNewSQLを求めて / Seeking a NewSQL Database That's Simple to Self-Host on Kubernetes
nnaka2992
0
180
The free-lunch guide to idea circularity
hollycummins
0
380
へんな働き方
yusukebe
6
2.9k
Tamach-sre-3_ANDPAD-shimaison93
mane12yurks38
0
130
モダンOBSプラグイン開発
umireon
0
180
存在論的プログラミング: 時間と存在を記述する
koriym
5
560
Featured
See All Featured
The Illustrated Children's Guide to Kubernetes
chrisshort
51
52k
Raft: Consensus for Rubyists
vanstee
141
7.4k
KATA
mclloyd
PRO
35
15k
Sam Torres - BigQuery for SEOs
techseoconnect
PRO
0
230
Code Reviewing Like a Champion
maltzj
528
40k
Tell your own story through comics
letsgokoyo
1
870
Done Done
chrislema
186
16k
技術選定の審美眼(2025年版) / Understanding the Spiral of Technologies 2025 edition
twada
PRO
118
110k
How to Align SEO within the Product Triangle To Get Buy-In & Support - #RIMC
aleyda
1
1.5k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
49
9.9k
A Guide to Academic Writing Using Generative AI - A Workshop
ks91
PRO
0
250
[RailsConf 2023] Rails as a piece of cake
palkan
59
6.4k
Transcript
Natural Language Toolkit @victorneo
Natural Language Processing
"the process of a computer extracting meaningful information from natural
language input and/or producing natural language output"
None
Getting started with NLTK
Open source Python modules, linguistic data and documentation for research
and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux. NLTK
None
installatio n # you might need numpy pip install nltk
# enter Python shell import nltk nltk.download()
None
packages # For Part of Speech tagging maxent_treebank_pos_tagger # Get
a list of stopwords stopwords # Brown corpus to play around brown
Preparing data / corpus
tokens NLTK works on Tokens, for example, "Hello World!" will
be tokenized to: ['Hello', 'World', '!'] The built-in tokenizer for most use cases: nltk.word_tokenize("Hello World!")
text processing HTML text: raw = nltk.clean_html(html_text) tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens) Use BeautifulSoup for preprocessing of the HTML text to discard unnecessary data.
Part-of-speech tagging
pos tagging text = "Run away!" nltk.word_tokenize(text) nltk.pos_tag(tokens) [('Run', 'NNP'),
('away', 'RB'), ('!', '.')]
pos tagging [('Run', 'NNP'), ('away', 'RB'), ('!', '.')] NNP: Proper
Noun, Singular RB : Adverb http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos. html
pos tagging "The sailor dogs the barmaid." [('The', 'DT'), ('sailor',
'NN'), ('dogs', 'NNS'), ('the', 'DT'), ('barmaid', 'NN'), ('.', '.')]
Sentiment Analysis Code: http://bit.ly/GLu2Q9
Differentiate between "happy" and "sad" tweets. Teach the classifier the
"features" of happy & sad tweets and test how good it is.
Happy: "Looking through old pics and realizing everything happens for
a reason. So happy with where I am right now" Sad: "So sad I have 8 AM class tomorrow"
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
happy.txt sad.txt happy_test.txt sad_test.txt } training data } testing data
Tweets obtained from Twitter Search API
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Happy tweets usually contain the following words: "am happy", "great
day" etc. Sad tweets usually contain the following: "not happy", "am sad" etc. features
{'contains(not)': False, 'contains(view)': False, 'contains(best)': False, 'contains(excited)': False, 'contains(morning)': False,
'contains(about)': False, 'contains(horrible)': True, 'contains(like)': False, ... } output of extract_features()
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
training_set = \ nltk.classify.util.\ apply_features(extract_features, tweets) classifier = \ NaiveBayesClassifier.train
(training_set) training the classifer training classifer
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
def classify_tweet(tweet): return \ classifier.classify(extract_features (tweet)) testing classifer
$ python classification.py Total accuracy: 90.00% (18/20) 18 tweets got
classified correctly.
Where to go from here.
http://www.nltk.org/book
https://class.coursera.org/nlp/auth/welcome
http://www.slideshare.net/shanbady/nltk-boston-text-analytics
[('Thank', 'NNP'), ('you', 'PRP'), ('.', '.')] @victorneo