Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
NLTK Intro for PUGS
Search
Victor Neo
March 27, 2012
Programming
7
570
NLTK Intro for PUGS
Slides for the NLTK talk given on March 2012 for Python User Group SG Meetup.
Victor Neo
March 27, 2012
Tweet
Share
More Decks by Victor Neo
See All by Victor Neo
Django - The Next Steps
victorneo
5
620
DevOps: Python tools to get started
victorneo
9
13k
Git and Python workshop
victorneo
2
790
Other Decks in Programming
See All in Programming
dbt民主化とLLMによる開発ブースト ~ AI Readyな分析サイクルを目指して ~
yoshyum
3
1.2k
MySQL9でベクトルカラム登場!PHP×AWSでのAI/類似検索はこう変わる
suguruooki
1
210
MDN Web Docs に日本語翻訳でコントリビュートしたくなる
ohmori_yusuke
1
130
AI Ramen Fight
yusukebe
0
100
React は次の10年を生き残れるか:3つのトレンドから考える
oukayuka
39
14k
新メンバーも今日から大活躍!SREが支えるスケールし続ける組織のオンボーディング
honmarkhunt
5
9.2k
GPUを計算資源として使おう!
primenumber
1
290
コーディングエージェント概観(2025/07)
itsuki_t88
0
120
The Niche of CDK Grant オブジェクトって何者?/the-niche-of-cdk-what-isgrant-object
hassaku63
1
660
おやつのお供はお決まりですか?@WWDC25 Recap -Japan-\(region).swift
shingangan
0
150
AI時代の『改訂新版 良いコード/悪いコードで学ぶ設計入門』 / ai-good-code-bad-code
minodriven
24
10k
[SRE NEXT] 複雑なシステムにおけるUser Journey SLOの導入
yakenji
0
690
Featured
See All Featured
Scaling GitHub
holman
461
140k
How to train your dragon (web standard)
notwaldorf
96
6.1k
Java REST API Framework Comparison - PWX 2021
mraible
31
8.7k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
229
22k
Balancing Empowerment & Direction
lara
1
490
Designing for humans not robots
tammielis
253
25k
BBQ
matthewcrist
89
9.7k
The Power of CSS Pseudo Elements
geoffreycrofte
77
5.9k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
34
3.1k
Imperfection Machines: The Place of Print at Facebook
scottboms
267
13k
Six Lessons from altMBA
skipperchong
28
3.9k
Mobile First: as difficult as doing things right
swwweet
223
9.7k
Transcript
Natural Language Toolkit @victorneo
Natural Language Processing
"the process of a computer extracting meaningful information from natural
language input and/or producing natural language output"
None
Getting started with NLTK
Open source Python modules, linguistic data and documentation for research
and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux. NLTK
None
installatio n # you might need numpy pip install nltk
# enter Python shell import nltk nltk.download()
None
packages # For Part of Speech tagging maxent_treebank_pos_tagger # Get
a list of stopwords stopwords # Brown corpus to play around brown
Preparing data / corpus
tokens NLTK works on Tokens, for example, "Hello World!" will
be tokenized to: ['Hello', 'World', '!'] The built-in tokenizer for most use cases: nltk.word_tokenize("Hello World!")
text processing HTML text: raw = nltk.clean_html(html_text) tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens) Use BeautifulSoup for preprocessing of the HTML text to discard unnecessary data.
Part-of-speech tagging
pos tagging text = "Run away!" nltk.word_tokenize(text) nltk.pos_tag(tokens) [('Run', 'NNP'),
('away', 'RB'), ('!', '.')]
pos tagging [('Run', 'NNP'), ('away', 'RB'), ('!', '.')] NNP: Proper
Noun, Singular RB : Adverb http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos. html
pos tagging "The sailor dogs the barmaid." [('The', 'DT'), ('sailor',
'NN'), ('dogs', 'NNS'), ('the', 'DT'), ('barmaid', 'NN'), ('.', '.')]
Sentiment Analysis Code: http://bit.ly/GLu2Q9
Differentiate between "happy" and "sad" tweets. Teach the classifier the
"features" of happy & sad tweets and test how good it is.
Happy: "Looking through old pics and realizing everything happens for
a reason. So happy with where I am right now" Sad: "So sad I have 8 AM class tomorrow"
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
happy.txt sad.txt happy_test.txt sad_test.txt } training data } testing data
Tweets obtained from Twitter Search API
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Happy tweets usually contain the following words: "am happy", "great
day" etc. Sad tweets usually contain the following: "not happy", "am sad" etc. features
{'contains(not)': False, 'contains(view)': False, 'contains(best)': False, 'contains(excited)': False, 'contains(morning)': False,
'contains(about)': False, 'contains(horrible)': True, 'contains(like)': False, ... } output of extract_features()
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
training_set = \ nltk.classify.util.\ apply_features(extract_features, tweets) classifier = \ NaiveBayesClassifier.train
(training_set) training the classifer training classifer
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
def classify_tweet(tweet): return \ classifier.classify(extract_features (tweet)) testing classifer
$ python classification.py Total accuracy: 90.00% (18/20) 18 tweets got
classified correctly.
Where to go from here.
http://www.nltk.org/book
https://class.coursera.org/nlp/auth/welcome
http://www.slideshare.net/shanbady/nltk-boston-text-analytics
[('Thank', 'NNP'), ('you', 'PRP'), ('.', '.')] @victorneo