Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
NLTK Intro for PUGS
Search
Victor Neo
March 27, 2012
Programming
7
550
NLTK Intro for PUGS
Slides for the NLTK talk given on March 2012 for Python User Group SG Meetup.
Victor Neo
March 27, 2012
Tweet
Share
More Decks by Victor Neo
See All by Victor Neo
Django - The Next Steps
victorneo
5
590
DevOps: Python tools to get started
victorneo
9
13k
Git and Python workshop
victorneo
2
780
Other Decks in Programming
See All in Programming
競技プログラミングへのお誘い@阪大BOOSTセミナー
kotamanegi
0
360
Kaigi on Railsに初参加したら、その日にLT登壇が決定した件について
tama50505
0
100
SymfonyCon Vienna 2025: Twig, still relevant in 2025?
fabpot
3
1.2k
毎日13時間もかかるバッチ処理をたった3日で60%短縮するためにやったこと
sho_ssk_
1
240
rails statsで大解剖 🔍 “B/43流” のRailsの育て方を歴史とともに振り返ります
shoheimitani
2
950
これでLambdaが不要に?!Step FunctionsのJSONata対応について
iwatatomoya
2
3.7k
今年のアップデートで振り返るCDKセキュリティのシフトレフト/2024-cdk-security-shift-left
tomoki10
0
210
From Translations to Multi Dimension Entities
alexanderschranz
2
140
return文におけるstd::moveについて
onihusube
1
1.2k
nekko cloudにおけるProxmox VE利用事例
irumaru
3
440
CQRS+ES の力を使って効果を感じる / Feel the effects of using the power of CQRS+ES
seike460
PRO
0
150
htmxって知っていますか?次世代のHTML
hiro_ghap1
0
350
Featured
See All Featured
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
8
1.2k
[RailsConf 2023] Rails as a piece of cake
palkan
53
5k
Statistics for Hackers
jakevdp
796
220k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
1
100
Building Flexible Design Systems
yeseniaperezcruz
327
38k
Imperfection Machines: The Place of Print at Facebook
scottboms
266
13k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
248
1.3M
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
232
17k
Code Review Best Practice
trishagee
65
17k
GraphQLの誤解/rethinking-graphql
sonatard
67
10k
RailsConf 2023
tenderlove
29
940
Thoughts on Productivity
jonyablonski
68
4.4k
Transcript
Natural Language Toolkit @victorneo
Natural Language Processing
"the process of a computer extracting meaningful information from natural
language input and/or producing natural language output"
None
Getting started with NLTK
Open source Python modules, linguistic data and documentation for research
and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux. NLTK
None
installatio n # you might need numpy pip install nltk
# enter Python shell import nltk nltk.download()
None
packages # For Part of Speech tagging maxent_treebank_pos_tagger # Get
a list of stopwords stopwords # Brown corpus to play around brown
Preparing data / corpus
tokens NLTK works on Tokens, for example, "Hello World!" will
be tokenized to: ['Hello', 'World', '!'] The built-in tokenizer for most use cases: nltk.word_tokenize("Hello World!")
text processing HTML text: raw = nltk.clean_html(html_text) tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens) Use BeautifulSoup for preprocessing of the HTML text to discard unnecessary data.
Part-of-speech tagging
pos tagging text = "Run away!" nltk.word_tokenize(text) nltk.pos_tag(tokens) [('Run', 'NNP'),
('away', 'RB'), ('!', '.')]
pos tagging [('Run', 'NNP'), ('away', 'RB'), ('!', '.')] NNP: Proper
Noun, Singular RB : Adverb http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos. html
pos tagging "The sailor dogs the barmaid." [('The', 'DT'), ('sailor',
'NN'), ('dogs', 'NNS'), ('the', 'DT'), ('barmaid', 'NN'), ('.', '.')]
Sentiment Analysis Code: http://bit.ly/GLu2Q9
Differentiate between "happy" and "sad" tweets. Teach the classifier the
"features" of happy & sad tweets and test how good it is.
Happy: "Looking through old pics and realizing everything happens for
a reason. So happy with where I am right now" Sad: "So sad I have 8 AM class tomorrow"
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
happy.txt sad.txt happy_test.txt sad_test.txt } training data } testing data
Tweets obtained from Twitter Search API
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Happy tweets usually contain the following words: "am happy", "great
day" etc. Sad tweets usually contain the following: "not happy", "am sad" etc. features
{'contains(not)': False, 'contains(view)': False, 'contains(best)': False, 'contains(excited)': False, 'contains(morning)': False,
'contains(about)': False, 'contains(horrible)': True, 'contains(like)': False, ... } output of extract_features()
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
training_set = \ nltk.classify.util.\ apply_features(extract_features, tweets) classifier = \ NaiveBayesClassifier.train
(training_set) training the classifer training classifer
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
def classify_tweet(tweet): return \ classifier.classify(extract_features (tweet)) testing classifer
$ python classification.py Total accuracy: 90.00% (18/20) 18 tweets got
classified correctly.
Where to go from here.
http://www.nltk.org/book
https://class.coursera.org/nlp/auth/welcome
http://www.slideshare.net/shanbady/nltk-boston-text-analytics
[('Thank', 'NNP'), ('you', 'PRP'), ('.', '.')] @victorneo