Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
570
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
WebViewの現在地 - SwiftUI時代のWebKit - / The Current State Of WebView
marcy731
0
110
PHPでWebSocketサーバーを実装しよう2025
kubotak
0
260
既存デザインを変更せずにタップ領域を広げる方法
tahia910
1
270
CursorはMCPを使った方が良いぞ
taigakono
1
220
AIと”コードの評価関数”を共有する / Share the "code evaluation function" with AI
euglena1215
1
130
今ならAmazon ECSのサービス間通信をどう選ぶか / Selection of ECS Interservice Communication 2025
tkikuc
21
3.9k
Kotlin エンジニアへ送る:Swift 案件に参加させられる日に備えて~似てるけど色々違う Swift の仕様 / from Kotlin to Swift
lovee
1
260
Composerが「依存解決」のためにどんな工夫をしているか #phpcon
o0h
PRO
1
250
20250704_教育事業におけるアジャイルなデータ基盤構築
hanon52_
5
610
PHPで始める振る舞い駆動開発(Behaviour-Driven Development)
ohmori_yusuke
2
260
deno-redisの紹介とJSRパッケージの運用について (toranoana.deno #21)
uki00a
0
180
「Cursor/Devin全社導入の理想と現実」のその後
saitoryc
0
740
Featured
See All Featured
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
48
2.9k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
252
21k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
8
680
Bootstrapping a Software Product
garrettdimon
PRO
307
110k
Fantastic passwords and where to find them - at NoRuKo
philnash
51
3.3k
Unsuck your backbone
ammeep
671
58k
Navigating Team Friction
lara
187
15k
Making Projects Easy
brettharned
116
6.3k
Gamification - CAS2011
davidbonilla
81
5.3k
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
657
60k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
233
17k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
10
940
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin