Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
NLTK Intro for PUGS
Search
Victor Neo
March 27, 2012
Programming
7
550
NLTK Intro for PUGS
Slides for the NLTK talk given on March 2012 for Python User Group SG Meetup.
Victor Neo
March 27, 2012
Tweet
Share
More Decks by Victor Neo
See All by Victor Neo
Django - The Next Steps
victorneo
5
600
DevOps: Python tools to get started
victorneo
9
13k
Git and Python workshop
victorneo
2
780
Other Decks in Programming
See All in Programming
Amazon Nova Reelの可能性
hideg
0
250
Amazon ECS とマイクロサービスから考えるシステム構成
hiyanger
1
180
Внедряем бюджетирование, или Как сделать хорошо?
lamodatech
0
980
Swiftコンパイラ超入門+async関数の仕組み
shiz
0
190
AHC041解説
terryu16
0
540
Fixstars高速化コンテスト2024準優勝解法
eijirou
0
200
Scaling your build logic
antalmonori
1
150
【PHP】破壊的バージョンアップと戦った話〜決断と説得
satoshi256kbyte
0
100
ペアーズでの、Langfuseを中心とした評価ドリブンなリリースサイクルのご紹介
fukubaka0825
1
200
Оптимизируем производительность блока Казначейство
lamodatech
0
990
Kanzawa.rbのLT大会を支える技術の裏側を変更する Ruby on Rails + Litestream 編
muryoimpl
0
120
BEエンジニアがFEの業務をできるようになるまでにやったこと
yoshida_ryushin
0
260
Featured
See All Featured
Done Done
chrislema
182
16k
Into the Great Unknown - MozCon
thekraken
34
1.6k
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
175
51k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
365
25k
It's Worth the Effort
3n
184
28k
YesSQL, Process and Tooling at Scale
rocio
171
14k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
127
19k
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
656
59k
Six Lessons from altMBA
skipperchong
27
3.6k
A Philosophy of Restraint
colly
203
16k
Large-scale JavaScript Application Architecture
addyosmani
510
110k
Transcript
Natural Language Toolkit @victorneo
Natural Language Processing
"the process of a computer extracting meaningful information from natural
language input and/or producing natural language output"
None
Getting started with NLTK
Open source Python modules, linguistic data and documentation for research
and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux. NLTK
None
installatio n # you might need numpy pip install nltk
# enter Python shell import nltk nltk.download()
None
packages # For Part of Speech tagging maxent_treebank_pos_tagger # Get
a list of stopwords stopwords # Brown corpus to play around brown
Preparing data / corpus
tokens NLTK works on Tokens, for example, "Hello World!" will
be tokenized to: ['Hello', 'World', '!'] The built-in tokenizer for most use cases: nltk.word_tokenize("Hello World!")
text processing HTML text: raw = nltk.clean_html(html_text) tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens) Use BeautifulSoup for preprocessing of the HTML text to discard unnecessary data.
Part-of-speech tagging
pos tagging text = "Run away!" nltk.word_tokenize(text) nltk.pos_tag(tokens) [('Run', 'NNP'),
('away', 'RB'), ('!', '.')]
pos tagging [('Run', 'NNP'), ('away', 'RB'), ('!', '.')] NNP: Proper
Noun, Singular RB : Adverb http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos. html
pos tagging "The sailor dogs the barmaid." [('The', 'DT'), ('sailor',
'NN'), ('dogs', 'NNS'), ('the', 'DT'), ('barmaid', 'NN'), ('.', '.')]
Sentiment Analysis Code: http://bit.ly/GLu2Q9
Differentiate between "happy" and "sad" tweets. Teach the classifier the
"features" of happy & sad tweets and test how good it is.
Happy: "Looking through old pics and realizing everything happens for
a reason. So happy with where I am right now" Sad: "So sad I have 8 AM class tomorrow"
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
happy.txt sad.txt happy_test.txt sad_test.txt } training data } testing data
Tweets obtained from Twitter Search API
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Happy tweets usually contain the following words: "am happy", "great
day" etc. Sad tweets usually contain the following: "not happy", "am sad" etc. features
{'contains(not)': False, 'contains(view)': False, 'contains(best)': False, 'contains(excited)': False, 'contains(morning)': False,
'contains(about)': False, 'contains(horrible)': True, 'contains(like)': False, ... } output of extract_features()
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
training_set = \ nltk.classify.util.\ apply_features(extract_features, tweets) classifier = \ NaiveBayesClassifier.train
(training_set) training the classifer training classifer
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
def classify_tweet(tweet): return \ classifier.classify(extract_features (tweet)) testing classifer
$ python classification.py Total accuracy: 90.00% (18/20) 18 tweets got
classified correctly.
Where to go from here.
http://www.nltk.org/book
https://class.coursera.org/nlp/auth/welcome
http://www.slideshare.net/shanbady/nltk-boston-text-analytics
[('Thank', 'NNP'), ('you', 'PRP'), ('.', '.')] @victorneo