Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
520
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
CREってこういうこと? 体験入社 - 提案資料 - / what-is-cre-trial-employment
shinden
1
620
Native Federation: The Future of Micro Frontends in Angular
manfredsteyer
PRO
0
170
GitHub Actionsの痒いところを埋めるサードパーティーランナー
dora1998
2
250
Powerfully Typed TypeScript
euxn23
3
1.6k
Three ways to use AI on Android: The Good, the Bad and the Ugly
marxallski
0
120
PHPコードの実行モデルを理解する / Understanding-the-PHP-Execution-Model
shin1x1
0
1.1k
TypeScriptの型とパフォーマンス (TSKaigi 2024)
ypresto
14
4.5k
Criando a Woovi em uma semana
daniloab
0
120
『WordPressコミュニティで学ぶ』OSS貢献の多様性
ippey
0
240
Jetpack Composeとデザインシステム
rmakiyama
0
230
Fragment Composition of GraphQL
quramy
14
1.7k
TypeScript 関数型スタイルでバックエンド開発のリアル
naoya
49
16k
Featured
See All Featured
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
155
14k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
22
1.4k
How to name files
jennybc
65
94k
The Mythical Team-Month
searls
217
42k
Done Done
chrislema
178
15k
A Philosophy of Restraint
colly
197
16k
Bash Introduction
62gerente
605
210k
Building Flexible Design Systems
yeseniaperezcruz
320
37k
Rails Girls Zürich Keynote
gr2m
91
13k
Designing the Hi-DPI Web
ddemaree
276
33k
Facilitating Awesome Meetings
lara
43
5.6k
Building an army of robots
kneath
300
41k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin