Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
NLTK Intro for PUGS
Search
Victor Neo
March 27, 2012
Programming
7
520
NLTK Intro for PUGS
Slides for the NLTK talk given on March 2012 for Python User Group SG Meetup.
Victor Neo
March 27, 2012
Tweet
Share
More Decks by Victor Neo
See All by Victor Neo
Django - The Next Steps
victorneo
5
530
DevOps: Python tools to get started
victorneo
9
12k
Git and Python workshop
victorneo
2
780
Other Decks in Programming
See All in Programming
複雑なドメインを扱うプロダクトの探索フェーズではいつどのようにテストをするのか / How to testing during exploratory phase
boykush
2
150
OpenTelemetry のサービスという概念について
azukiazusa1
1
410
Kotlinを用いたDSL的な設計手法と使用上の注意
kohii00
3
540
CircleCIを活用して AWSへの継続的デリバリーを 実践する
coconala_engineer
1
110
イベントストーミングによるオブジェクトモデリング・オブジェクト指向プログラミングの適用・開発プロセスの変遷・アーキテクチャの変革 / Object modeling with Event Storming.
nrslib
12
3.2k
OpenAPI を守るのは難しい
ohmori_yusuke
2
150
導入から5年が経って見えた Datadog APM 運用の課題
bgpat
2
540
せっかくモデル図描くのなら、嬉しいことが多い方がいいよね!
kuboaki
1
1.8k
Introduction for Open Source Swift Workshop
giginet
PRO
0
290
ここ1~2年くらいで 使えるようになった(主要ブラウザーの最新版 がすべて対応した ) ウェブの新機能について ランダムに喋る!
myzkyy
9
6.5k
オブジェクト指向のリ・オリエンテーション~歴史を振り返り、AI時代に向きなおる~
hanyudaeiiti
2
270
Honoとhtmx
yusukebe
6
1.2k
Featured
See All Featured
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
18
1.7k
Web development in the modern age
philhawksworth
201
10k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
242
20k
Making the Leap to Tech Lead
cromwellryan
123
8.4k
Being A Developer After 40
akosma
56
580k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
34
8.8k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
185
15k
Unsuck your backbone
ammeep
661
56k
Teambox: Starting and Learning
jrom
126
8.4k
Principles of Awesome APIs and How to Build Them.
keavy
119
16k
Designing for Performance
lara
601
67k
Practical Orchestrator
shlominoach
180
9.7k
Transcript
Natural Language Toolkit @victorneo
Natural Language Processing
"the process of a computer extracting meaningful information from natural
language input and/or producing natural language output"
None
Getting started with NLTK
Open source Python modules, linguistic data and documentation for research
and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux. NLTK
None
installatio n # you might need numpy pip install nltk
# enter Python shell import nltk nltk.download()
None
packages # For Part of Speech tagging maxent_treebank_pos_tagger # Get
a list of stopwords stopwords # Brown corpus to play around brown
Preparing data / corpus
tokens NLTK works on Tokens, for example, "Hello World!" will
be tokenized to: ['Hello', 'World', '!'] The built-in tokenizer for most use cases: nltk.word_tokenize("Hello World!")
text processing HTML text: raw = nltk.clean_html(html_text) tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens) Use BeautifulSoup for preprocessing of the HTML text to discard unnecessary data.
Part-of-speech tagging
pos tagging text = "Run away!" nltk.word_tokenize(text) nltk.pos_tag(tokens) [('Run', 'NNP'),
('away', 'RB'), ('!', '.')]
pos tagging [('Run', 'NNP'), ('away', 'RB'), ('!', '.')] NNP: Proper
Noun, Singular RB : Adverb http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos. html
pos tagging "The sailor dogs the barmaid." [('The', 'DT'), ('sailor',
'NN'), ('dogs', 'NNS'), ('the', 'DT'), ('barmaid', 'NN'), ('.', '.')]
Sentiment Analysis Code: http://bit.ly/GLu2Q9
Differentiate between "happy" and "sad" tweets. Teach the classifier the
"features" of happy & sad tweets and test how good it is.
Happy: "Looking through old pics and realizing everything happens for
a reason. So happy with where I am right now" Sad: "So sad I have 8 AM class tomorrow"
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
happy.txt sad.txt happy_test.txt sad_test.txt } training data } testing data
Tweets obtained from Twitter Search API
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
Happy tweets usually contain the following words: "am happy", "great
day" etc. Sad tweets usually contain the following: "not happy", "am sad" etc. features
{'contains(not)': False, 'contains(view)': False, 'contains(best)': False, 'contains(excited)': False, 'contains(morning)': False,
'contains(about)': False, 'contains(horrible)': True, 'contains(like)': False, ... } output of extract_features()
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
training_set = \ nltk.classify.util.\ apply_features(extract_features, tweets) classifier = \ NaiveBayesClassifier.train
(training_set) training the classifer training classifer
Process data (tweets) Extract Features Train classifier Test classifer accuracy
Tokenize tweets extract_features Naive Bayes Classifier
def classify_tweet(tweet): return \ classifier.classify(extract_features (tweet)) testing classifer
$ python classification.py Total accuracy: 90.00% (18/20) 18 tweets got
classified correctly.
Where to go from here.
http://www.nltk.org/book
https://class.coursera.org/nlp/auth/welcome
http://www.slideshare.net/shanbady/nltk-boston-text-analytics
[('Thank', 'NNP'), ('you', 'PRP'), ('.', '.')] @victorneo