Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
NLTK Intro for PUGS
Search
Victor Neo
March 27, 2012
Programming
7
590
NLTK Intro for PUGS
Slides for the NLTK talk given on March 2012 for Python User Group SG Meetup.
Victor Neo
March 27, 2012
Tweet
Share
More Decks by Victor Neo
See All by Victor Neo
Django - The Next Steps
victorneo
5
670
DevOps: Python tools to get started
victorneo
9
13k
Git and Python workshop
victorneo
2
800
Other Decks in Programming
See All in Programming
AI駆動開発の本音 〜Claude Code並列開発で見えたエンジニアの新しい役割〜
hisuzuya
4
530
DevinとClaude Code、SREの現場で使い倒してみた件
karia
1
1.1k
Symfony + NelmioApiDocBundle を使った スキーマ駆動開発 / Schema Driven Development with NelmioApiDocBundle
okashoi
0
210
Vuetify 3 → 4 何が変わった?差分と移行ポイント10分まとめ
koukimiura
0
170
ふつうの Rubyist、ちいさなデバイス、大きな一年
bash0c7
0
1.1k
ベクトル検索のフィルタを用いた機械学習モデルとの統合 / python-meetup-fukuoka-06-vector-attr
monochromegane
2
500
S3ストレージクラスの「見える」「ある」「使える」は全部違う ─ 体験から見た、仕様の深淵を覗く
ya_ma23
0
880
へんな働き方
yusukebe
5
2.8k
米国のサイバーセキュリティタイムラインと見る Goの暗号パッケージの進化
tomtwinkle
2
640
go directiveを最新にしすぎないで欲しい話──あるいは、Go 1.26からgo mod initで作られるgo directiveの値が変わる話 / Go 1.26 リリースパーティ
arthur1
2
580
野球解説AI Agentを開発してみた - 2026/02/27 LayerX社内LT会資料
shinyorke
PRO
0
350
Everything Claude Code OSS詳細 — 5層構造の中身と導入方法
targe
0
150
Featured
See All Featured
Context Engineering - Making Every Token Count
addyosmani
9
770
How to Grow Your eCommerce with AI & Automation
katarinadahlin
PRO
1
150
Impact Scores and Hybrid Strategies: The future of link building
tamaranovitovic
0
240
The Power of CSS Pseudo Elements
geoffreycrofte
82
6.2k
The untapped power of vector embeddings
frankvandijk
2
1.6k
SEO for Brand Visibility & Recognition
aleyda
0
4.4k
Designing for humans not robots
tammielis
254
26k
From π to Pie charts
rasagy
0
150
How to Align SEO within the Product Triangle To Get Buy-In & Support - #RIMC
aleyda
1
1.5k
Pawsitive SEO: Lessons from My Dog (and Many Mistakes) on Thriving as a Consultant in the Age of AI
davidcarrasco
0
92
XXLCSS - How to scale CSS and keep your sanity
sugarenia
249
1.3M
The Mindset for Success: Future Career Progression
greggifford
PRO
0
280
Transcript
Natural Language Toolkit @victorneo
Natural Language Processing
"the process of a computer extracting meaningful information from natural
language input and/or producing natural language output"
None
Getting started with NLTK
Open source Python modules, linguistic data and documentation for research
and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux. NLTK
None
installatio n # you might need numpy pip install nltk
# enter Python shell import nltk nltk.download()
None
packages # For Part of Speech tagging maxent_treebank_pos_tagger # Get
a list of stopwords stopwords # Brown corpus to play around brown
Preparing data / corpus
tokens NLTK works on Tokens, for example, "Hello World!" will
be tokenized to: ['Hello', 'World', '!'] The built-in tokenizer for most use cases: nltk.word_tokenize("Hello World!")
text processing HTML text: raw = nltk.clean_html(html_text) tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens) Use BeautifulSoup for preprocessing of the HTML text to discard unnecessary data.
Part-of-speech tagging
pos tagging text = "Run away!" nltk.word_tokenize(text) nltk.pos_tag(tokens) [('Run', 'NNP'),
('away', 'RB'), ('!', '.')]
pos tagging [('Run', 'NNP'), ('away', 'RB'), ('!', '.')] NNP: Proper
Noun, Singular RB : Adverb http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos. html
pos tagging "The sailor dogs the barmaid." [('The', 'DT'), ('sailor',
'NN'), ('dogs', 'NNS'), ('the', 'DT'), ('barmaid', 'NN'), ('.', '.')]
Sentiment Analysis Code: http://bit.ly/GLu2Q9
Differentiate between "happy" and "sad" tweets. Teach the classifier the
"features" of happy & sad tweets and test how good it is.
Happy: "Looking through old pics and realizing everything happens for
a reason. So happy with where I am right now" Sad: "So sad I have 8 AM class tomorrow"
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
happy.txt sad.txt happy_test.txt sad_test.txt } training data } testing data
Tweets obtained from Twitter Search API
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Happy tweets usually contain the following words: "am happy", "great
day" etc. Sad tweets usually contain the following: "not happy", "am sad" etc. features
{'contains(not)': False, 'contains(view)': False, 'contains(best)': False, 'contains(excited)': False, 'contains(morning)': False,
'contains(about)': False, 'contains(horrible)': True, 'contains(like)': False, ... } output of extract_features()
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
training_set = \ nltk.classify.util.\ apply_features(extract_features, tweets) classifier = \ NaiveBayesClassifier.train
(training_set) training the classifer training classifer
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
def classify_tweet(tweet): return \ classifier.classify(extract_features (tweet)) testing classifer
$ python classification.py Total accuracy: 90.00% (18/20) 18 tweets got
classified correctly.
Where to go from here.
http://www.nltk.org/book
https://class.coursera.org/nlp/auth/welcome
http://www.slideshare.net/shanbady/nltk-boston-text-analytics
[('Thank', 'NNP'), ('you', 'PRP'), ('.', '.')] @victorneo