$30 off During Our Annual Pro Sale. View Details »
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
認証・認可の基本を学ぼう後編
kouyuume
0
240
Why Kotlin? 電子カルテを Kotlin で開発する理由 / Why Kotlin? at Henry
agatan
2
7.2k
Rediscover the Console - SymfonyCon Amsterdam 2025
chalasr
2
160
ゲームの物理 剛体編
fadis
0
350
令和最新版Android Studioで化石デバイス向けアプリを作る
arkw
0
410
SwiftUIで本格音ゲー実装してみた
hypebeans
0
380
React Native New Architecture 移行実践報告
taminif
1
150
AIエージェントを活かすPM術 AI駆動開発の現場から
gyuta
0
420
なあ兄弟、 余白の意味を考えてから UI実装してくれ!
ktcryomm
11
11k
tparseでgo testの出力を見やすくする
utgwkk
2
220
Giselleで作るAI QAアシスタント 〜 Pull Requestレビューに継続的QAを
codenote
0
190
WebRTC と Rust と8K 60fps
tnoho
2
2k
Featured
See All Featured
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
52
5.8k
Keith and Marios Guide to Fast Websites
keithpitt
413
23k
A designer walks into a library…
pauljervisheath
210
24k
Build your cross-platform service in a week with App Engine
jlugia
234
18k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
285
14k
Six Lessons from altMBA
skipperchong
29
4.1k
jQuery: Nuts, Bolts and Bling
dougneiner
65
8.3k
Bash Introduction
62gerente
615
210k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
359
30k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
46
2.6k
Optimising Largest Contentful Paint
csswizardry
37
3.5k
Leading Effective Engineering Teams in the AI Era
addyosmani
8
1.3k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin