Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
AtCoder Conference 2025
shindannin
0
860
まだ間に合う!Claude Code元年をふりかえる
nogu66
5
920
tparseでgo testの出力を見やすくする
utgwkk
2
330
開発に寄りそう自動テストの実現
goyoki
2
1.7k
Rubyで鍛える仕組み化プロヂュース力
muryoimpl
0
270
はじめてのカスタムエージェント【GitHub Copilot Agent Mode編】
satoshi256kbyte
0
140
Combinatorial Interview Problems with Backtracking Solutions - From Imperative Procedural Programming to Declarative Functional Programming - Part 2
philipschwarz
PRO
0
130
re:Invent 2025 のイケてるサービスを紹介する
maroon1st
0
160
Implementation Patterns
denyspoltorak
0
140
AtCoder Conference 2025「LLM時代のAHC」
imjk
2
620
ゲームの物理 剛体編
fadis
0
390
HTTPプロトコル正しく理解していますか? 〜かわいい猫と共に学ぼう。ฅ^•ω•^ฅ ニャ〜
hekuchan
2
580
Featured
See All Featured
How to Get Subject Matter Experts Bought In and Actively Contributing to SEO & PR Initiatives.
livdayseo
0
37
Designing Powerful Visuals for Engaging Learning
tmiket
0
190
Color Theory Basics | Prateek | Gurzu
gurzu
0
160
Sam Torres - BigQuery for SEOs
techseoconnect
PRO
0
150
Why You Should Never Use an ORM
jnunemaker
PRO
61
9.7k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
49
3.3k
Max Prin - Stacking Signals: How International SEO Comes Together (And Falls Apart)
techseoconnect
PRO
0
58
Crafting Experiences
bethany
0
25
AI Search: Where Are We & What Can We Do About It?
aleyda
0
6.8k
Raft: Consensus for Rubyists
vanstee
141
7.3k
A Tale of Four Properties
chriscoyier
162
23k
Stop Working from a Prison Cell
hatefulcrawdad
273
21k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin