Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
NLTK Intro for PUGS
Search
Victor Neo
March 27, 2012
Programming
600
7
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
NLTK Intro for PUGS
Slides for the NLTK talk given on March 2012 for Python User Group SG Meetup.
Victor Neo
March 27, 2012
More Decks by Victor Neo
See All by Victor Neo
Django - The Next Steps
victorneo
5
700
DevOps: Python tools to get started
victorneo
9
13k
Git and Python workshop
victorneo
2
820
Other Decks in Programming
See All in Programming
エンジニアと一緒にテストコードの設計と実装を改善した話
mototakatsu
0
220
TSKaigi Night Talks 2026_TypeScriptでサプライチェーンの整合性を型に閉じ込める
geekplus_tech
0
400
Spring Security 実践 ─ GraphQL APIで実務に役立つ 認証・認可 を学ぶ
wagyu
0
260
The NotImplementedError Problem in Ruby
koic
1
920
Dataformのリポジトリを立ち上げるときにまずやること / dataform-day0-2026
snhryt
0
180
AIだと陥りがちなJakarta EE最新技術への移行時の落とし穴と解決策
tnagao7
0
120
正しくソフトウェアを作る、前提を疑うための認知の視点 / doubt-premise
minodriven
21
7k
才能?センス?知らん、 続けたもん勝ちだ。-- 結婚・出産・癌を越えてなお、私がプロダクトを創り続ける理由
16bitidol
1
200
Snowflake Summitでの新機能 CoCo / CoWork / snowflake-summit-2026-overall-what-new-coco
tatsuhiro
1
180
キャリア迷子上等 ─ "ない道"は自分で作ればいい
16bitidol
3
2.3k
技術記事、 専門家としてのプログラマ、 言語化
mizchi
13
6.5k
さぁV100、メモリをお食べ・・・
nilpe
0
150
Featured
See All Featured
Designing Powerful Visuals for Engaging Learning
tmiket
1
420
Paper Plane
katiecoart
PRO
1
52k
Scaling GitHub
holman
464
140k
The Impact of AI in SEO - AI Overviews June 2024 Edition
aleyda
5
1.1k
HTML-Aware ERB: The Path to Reactive Rendering @ RubyCon 2026, Rimini, Italy
marcoroth
1
230
Collaborative Software Design: How to facilitate domain modelling decisions
baasie
1
250
Building Adaptive Systems
keathley
44
3.1k
Leveraging Curiosity to Care for An Aging Population
cassininazir
1
270
Game over? The fight for quality and originality in the time of robots
wayneb77
1
200
The Straight Up "How To Draw Better" Workshop
denniskardys
239
140k
Making Projects Easy
brettharned
120
6.7k
Git: the NoSQL Database
bkeepers
PRO
432
67k
Transcript
Natural Language Toolkit @victorneo
Natural Language Processing
"the process of a computer extracting meaningful information from natural
language input and/or producing natural language output"
None
Getting started with NLTK
Open source Python modules, linguistic data and documentation for research
and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux. NLTK
None
installatio n # you might need numpy pip install nltk
# enter Python shell import nltk nltk.download()
None
packages # For Part of Speech tagging maxent_treebank_pos_tagger # Get
a list of stopwords stopwords # Brown corpus to play around brown
Preparing data / corpus
tokens NLTK works on Tokens, for example, "Hello World!" will
be tokenized to: ['Hello', 'World', '!'] The built-in tokenizer for most use cases: nltk.word_tokenize("Hello World!")
text processing HTML text: raw = nltk.clean_html(html_text) tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens) Use BeautifulSoup for preprocessing of the HTML text to discard unnecessary data.
Part-of-speech tagging
pos tagging text = "Run away!" nltk.word_tokenize(text) nltk.pos_tag(tokens) [('Run', 'NNP'),
('away', 'RB'), ('!', '.')]
pos tagging [('Run', 'NNP'), ('away', 'RB'), ('!', '.')] NNP: Proper
Noun, Singular RB : Adverb http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos. html
pos tagging "The sailor dogs the barmaid." [('The', 'DT'), ('sailor',
'NN'), ('dogs', 'NNS'), ('the', 'DT'), ('barmaid', 'NN'), ('.', '.')]
Sentiment Analysis Code: http://bit.ly/GLu2Q9
Differentiate between "happy" and "sad" tweets. Teach the classifier the
"features" of happy & sad tweets and test how good it is.
Happy: "Looking through old pics and realizing everything happens for
a reason. So happy with where I am right now" Sad: "So sad I have 8 AM class tomorrow"
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
happy.txt sad.txt happy_test.txt sad_test.txt } training data } testing data
Tweets obtained from Twitter Search API
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Happy tweets usually contain the following words: "am happy", "great
day" etc. Sad tweets usually contain the following: "not happy", "am sad" etc. features
{'contains(not)': False, 'contains(view)': False, 'contains(best)': False, 'contains(excited)': False, 'contains(morning)': False,
'contains(about)': False, 'contains(horrible)': True, 'contains(like)': False, ... } output of extract_features()
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
training_set = \ nltk.classify.util.\ apply_features(extract_features, tweets) classifier = \ NaiveBayesClassifier.train
(training_set) training the classifer training classifer
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
def classify_tweet(tweet): return \ classifier.classify(extract_features (tweet)) testing classifer
$ python classification.py Total accuracy: 90.00% (18/20) 18 tweets got
classified correctly.
Where to go from here.
http://www.nltk.org/book
https://class.coursera.org/nlp/auth/welcome
http://www.slideshare.net/shanbady/nltk-boston-text-analytics
[('Thank', 'NNP'), ('you', 'PRP'), ('.', '.')] @victorneo