Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
Domain-centric? Why Hexagonal, Onion, and Clean Architecture Are Answers to the Wrong Question
olivergierke
2
640
Conquering Massive Traffic Spikes in Ruby Applications with Pitchfork
riseshia
0
150
dynamic!
moro
9
6.9k
monorepo の Go テストをはやくした〜い!~最小の依存解決への道のり~ / faster-testing-of-monorepos
convto
2
430
技術的負債の正体を知って向き合う / Facing Technical Debt
irof
0
120
Swift Concurrency - 状態監視の罠
objectiveaudio
2
480
10年もののAPIサーバーにおけるCI/CDの改善の奮闘
mbook
0
790
iOSエンジニア向けの英語学習アプリを作る!
yukawashouhei
0
180
ネイティブ製ガントチャートUIを作って学ぶUICollectionViewLayoutの威力
jrsaruo
0
140
CSC509 Lecture 02
javiergs
PRO
0
410
CSC305 Lecture 01
javiergs
PRO
1
400
Reduxモダナイズ 〜コードのモダン化を通して、将来のライブラリ移行に備える〜
pvcresin
2
690
Featured
See All Featured
Scaling GitHub
holman
463
140k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
333
22k
Being A Developer After 40
akosma
91
590k
Embracing the Ebb and Flow
colly
88
4.8k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.5k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
29
2.6k
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
657
61k
Building Adaptive Systems
keathley
43
2.8k
StorybookのUI Testing Handbookを読んだ
zakiyama
31
6.2k
Stop Working from a Prison Cell
hatefulcrawdad
271
21k
Designing for Performance
lara
610
69k
Into the Great Unknown - MozCon
thekraken
40
2.1k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin