$30 off During Our Annual Pro Sale. View Details »
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
NLTK Intro for PUGS
Search
Victor Neo
March 27, 2012
Programming
7
570
NLTK Intro for PUGS
Slides for the NLTK talk given on March 2012 for Python User Group SG Meetup.
Victor Neo
March 27, 2012
Tweet
Share
More Decks by Victor Neo
See All by Victor Neo
Django - The Next Steps
victorneo
5
650
DevOps: Python tools to get started
victorneo
9
13k
Git and Python workshop
victorneo
2
790
Other Decks in Programming
See All in Programming
Context is King? 〜Verifiability時代とコンテキスト設計 / Beyond "Context is King"
rkaga
6
980
チームをチームにするEM
hitode909
0
290
ローターアクトEクラブ アメリカンナイト:川端 柚菜 氏(Japan O.K. ローターアクトEクラブ 会長):2720 Japan O.K. ロータリーEクラブ2025年12月1日卓話
2720japanoke
0
720
dotfiles 式年遷宮 令和最新版
masawada
1
730
20251127_ぼっちのための懇親会対策会議
kokamoto01_metaps
2
420
CSC509 Lecture 14
javiergs
PRO
0
220
Full-Cycle Reactivity in Angular: SignalStore mit Signal Forms und Resources
manfredsteyer
PRO
0
120
新卒エンジニアのプルリクエスト with AI駆動
fukunaga2025
0
190
社内オペレーション改善のためのTypeScript / TSKaigi Hokuriku 2025
dachi023
1
550
バックエンドエンジニアによる Amebaブログ K8s 基盤への CronJobの導入・運用経験
sunabig
0
140
手軽に積ん読を増やすには?/読みたい本と付き合うには?
o0h
PRO
1
170
俺流レスポンシブコーディング 2025
tak_dcxi
14
8.4k
Featured
See All Featured
Imperfection Machines: The Place of Print at Facebook
scottboms
269
13k
Visualization
eitanlees
150
16k
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
15k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
47
7.8k
Building a Modern Day E-commerce SEO Strategy
aleyda
45
8.3k
Making the Leap to Tech Lead
cromwellryan
135
9.7k
Measuring & Analyzing Core Web Vitals
bluesmoon
9
700
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
35
2.3k
Music & Morning Musume
bryan
46
7k
Scaling GitHub
holman
464
140k
Thoughts on Productivity
jonyablonski
73
5k
Transcript
Natural Language Toolkit @victorneo
Natural Language Processing
"the process of a computer extracting meaningful information from natural
language input and/or producing natural language output"
None
Getting started with NLTK
Open source Python modules, linguistic data and documentation for research
and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux. NLTK
None
installatio n # you might need numpy pip install nltk
# enter Python shell import nltk nltk.download()
None
packages # For Part of Speech tagging maxent_treebank_pos_tagger # Get
a list of stopwords stopwords # Brown corpus to play around brown
Preparing data / corpus
tokens NLTK works on Tokens, for example, "Hello World!" will
be tokenized to: ['Hello', 'World', '!'] The built-in tokenizer for most use cases: nltk.word_tokenize("Hello World!")
text processing HTML text: raw = nltk.clean_html(html_text) tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens) Use BeautifulSoup for preprocessing of the HTML text to discard unnecessary data.
Part-of-speech tagging
pos tagging text = "Run away!" nltk.word_tokenize(text) nltk.pos_tag(tokens) [('Run', 'NNP'),
('away', 'RB'), ('!', '.')]
pos tagging [('Run', 'NNP'), ('away', 'RB'), ('!', '.')] NNP: Proper
Noun, Singular RB : Adverb http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos. html
pos tagging "The sailor dogs the barmaid." [('The', 'DT'), ('sailor',
'NN'), ('dogs', 'NNS'), ('the', 'DT'), ('barmaid', 'NN'), ('.', '.')]
Sentiment Analysis Code: http://bit.ly/GLu2Q9
Differentiate between "happy" and "sad" tweets. Teach the classifier the
"features" of happy & sad tweets and test how good it is.
Happy: "Looking through old pics and realizing everything happens for
a reason. So happy with where I am right now" Sad: "So sad I have 8 AM class tomorrow"
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
happy.txt sad.txt happy_test.txt sad_test.txt } training data } testing data
Tweets obtained from Twitter Search API
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Happy tweets usually contain the following words: "am happy", "great
day" etc. Sad tweets usually contain the following: "not happy", "am sad" etc. features
{'contains(not)': False, 'contains(view)': False, 'contains(best)': False, 'contains(excited)': False, 'contains(morning)': False,
'contains(about)': False, 'contains(horrible)': True, 'contains(like)': False, ... } output of extract_features()
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
training_set = \ nltk.classify.util.\ apply_features(extract_features, tweets) classifier = \ NaiveBayesClassifier.train
(training_set) training the classifer training classifer
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
def classify_tweet(tweet): return \ classifier.classify(extract_features (tweet)) testing classifer
$ python classification.py Total accuracy: 90.00% (18/20) 18 tweets got
classified correctly.
Where to go from here.
http://www.nltk.org/book
https://class.coursera.org/nlp/auth/welcome
http://www.slideshare.net/shanbady/nltk-boston-text-analytics
[('Thank', 'NNP'), ('you', 'PRP'), ('.', '.')] @victorneo