Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
AtCoder Conference 2025
shindannin
0
1k
例外処理とどう使い分ける?Result型を使ったエラー設計 #burikaigi
kajitack
16
5.9k
Oxlintはいいぞ
yug1224
5
1.2k
Fragmented Architectures
denyspoltorak
0
140
Implementation Patterns
denyspoltorak
0
270
責任感のあるCloudWatchアラームを設計しよう
akihisaikeda
3
130
QAフローを最適化し、品質水準を満たしながらリリースまでの期間を最短化する #RSGT2026
shibayu36
2
4.1k
余白を設計しフロントエンド開発を 加速させる
tsukuha
7
2.1k
Honoを使ったリモートMCPサーバでAIツールとの連携を加速させる!
tosuri13
1
170
AIエージェントの設計で注意するべきポイント6選
har1101
7
3.3k
Pythonではじめるオープンデータ分析〜書籍の紹介と書籍で紹介しきれなかった事例の紹介〜
welliving
3
860
AIフル活用時代だからこそ学んでおきたい働き方の心得
shinoyu
0
120
Featured
See All Featured
How to Grow Your eCommerce with AI & Automation
katarinadahlin
PRO
0
100
For a Future-Friendly Web
brad_frost
182
10k
Why Your Marketing Sucks and What You Can Do About It - Sophie Logan
marketingsoph
0
69
Dominate Local Search Results - an insider guide to GBP, reviews, and Local SEO
greggifford
PRO
0
51
The Invisible Side of Design
smashingmag
302
51k
Agile that works and the tools we love
rasmusluckow
331
21k
Claude Code どこまでも/ Claude Code Everywhere
nwiizo
61
52k
End of SEO as We Know It (SMX Advanced Version)
ipullrank
3
3.9k
Navigating Weather and Climate Data
rabernat
0
82
Building a Scalable Design System with Sketch
lauravandoore
463
34k
Bridging the Design Gap: How Collaborative Modelling removes blockers to flow between stakeholders and teams @FastFlow conf
baasie
0
440
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
333
22k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin