Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
CSC307 Lecture 05
javiergs
PRO
0
500
AIによるイベントストーミング図からのコード生成 / AI-powered code generation from Event Storming diagrams
nrslib
2
1.9k
高速開発のためのコード整理術
sutetotanuki
1
410
CSC307 Lecture 06
javiergs
PRO
0
690
要求定義・仕様記述・設計・検証の手引き - 理論から学ぶ明確で統一された成果物定義
orgachem
PRO
1
210
AIエージェントのキホンから学ぶ「エージェンティックコーディング」実践入門
masahiro_nishimi
6
600
プロダクトオーナーから見たSOC2 _SOC2ゆるミートアップ#2
kekekenta
0
220
CSC307 Lecture 08
javiergs
PRO
0
670
QAフローを最適化し、品質水準を満たしながらリリースまでの期間を最短化する #RSGT2026
shibayu36
2
4.4k
CSC307 Lecture 01
javiergs
PRO
0
690
izumin5210のプロポーザルのネタ探し #tskaigi_msup
izumin5210
1
140
Raku Raku Notion 20260128
hareyakayuruyaka
0
360
Featured
See All Featured
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
133
19k
What's in a price? How to price your products and services
michaelherold
247
13k
Efficient Content Optimization with Google Search Console & Apps Script
katarinadahlin
PRO
1
330
Beyond borders and beyond the search box: How to win the global "messy middle" with AI-driven SEO
davidcarrasco
1
57
WENDY [Excerpt]
tessaabrams
9
36k
The Spectacular Lies of Maps
axbom
PRO
1
530
Done Done
chrislema
186
16k
The Mindset for Success: Future Career Progression
greggifford
PRO
0
240
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
46
2.7k
Fantastic passwords and where to find them - at NoRuKo
philnash
52
3.6k
Color Theory Basics | Prateek | Gurzu
gurzu
0
200
Learning to Love Humans: Emotional Interface Design
aarron
275
41k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin