Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
Ruby x Terminal
a_matsuda
7
590
DevinとClaude Code、SREの現場で使い倒してみた件
karia
1
990
ふつうの Rubyist、ちいさなデバイス、大きな一年
bash0c7
0
780
AIとペアプロして処理時間を97%削減した話 #pyconshizu
kashewnuts
1
210
grapheme_strrev関数が採択されました(あと雑感)
youkidearitai
PRO
1
210
Claude Code の Skill で複雑な既存仕様をすっきり整理しよう
yuichirokato
1
360
Go 1.26でのsliceのメモリアロケーション最適化 / Go 1.26 リリースパーティ #go126party
mazrean
1
370
S3ストレージクラスの「見える」「ある」「使える」は全部違う ─ 体験から見た、仕様の深淵を覗く
ya_ma23
0
230
AIコーディングの理想と現実 2026 | AI Coding: Expectations vs. Reality 2026
tomohisa
0
1.2k
What Spring Developers Should Know About Jakarta EE
ivargrimstad
0
340
受け入れテスト駆動開発(ATDD)×AI駆動開発 AI時代のATDDの取り組み方を考える
kztakasaki
2
550
Unity6.3 AudioUpdate
cova8bitdots
0
120
Featured
See All Featured
The Curse of the Amulet
leimatthew05
1
9.8k
Facilitating Awesome Meetings
lara
57
6.8k
Breaking role norms: Why Content Design is so much more than writing copy - Taylor Woolridge
uxyall
0
200
Money Talks: Using Revenue to Get Sh*t Done
nikkihalliwell
0
180
GraphQLの誤解/rethinking-graphql
sonatard
75
11k
How Fast Is Fast Enough? [PerfNow 2025]
tammyeverts
3
480
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
122
21k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
16k
Designing for humans not robots
tammielis
254
26k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
31
3.1k
First, design no harm
axbom
PRO
2
1.1k
Evolving SEO for Evolving Search Engines
ryanjones
0
150
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin