Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
590
2
Share
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Other Decks in Programming
See All in Programming
Java × distroless で 軽量なコンテナイメージを / Java on Distroless
contour_gara
0
210
CSC307 Lecture 17
javiergs
PRO
0
240
Old Dog, New Tricks: The Java 25 Reinvention - JNation
bazlur_rahman
0
130
UaaL×Androidアプリのメモリ計測 — Memory Profilerの先へ
rio432
0
180
ふつうのFeature Flag実践入門
irof
6
2.5k
技術記事、AIに書かせるか、自分で書くか? 〜それでも私が自分の手で書く理由〜 / #QiitaConference
jnchito
2
520
LLM Plugin for Node-REDの利用方法と開発について
404background
0
110
プラグインで拡張される Context をtype-safe にする難しさと設計判断
kazupon
2
300
TSKaigi2026-静的解析への投資がAI時代のコード品質を支える ── カスタムESLintルールの設計と運用
hayatokudou
6
1.1k
ReactとSvelteのその先、Ripple-TS / Beyond React and Svelte: Ripple-TS
ssssota
3
830
Why Laravel apps break—Mastering the fundamentals to keep them maintainable
kentaroutakeda
1
210
要はバランスからの卒業 #yumemi_grow
kajitack
0
190
Featured
See All Featured
Paper Plane (Part 1)
katiecoart
PRO
0
7.9k
For a Future-Friendly Web
brad_frost
183
10k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
35
2.4k
The browser strikes back
jonoalderson
0
1.1k
We Analyzed 250 Million AI Search Results: Here's What I Found
joshbly
1
1.3k
Being A Developer After 40
akosma
91
590k
B2B Lead Gen: Tactics, Traps & Triumph
marketingsoph
0
120
The State of eCommerce SEO: How to Win in Today's Products SERPs - #SEOweek
aleyda
2
11k
The World Runs on Bad Software
bkeepers
PRO
72
12k
How to train your dragon (web standard)
notwaldorf
97
6.6k
The Straight Up "How To Draw Better" Workshop
denniskardys
239
140k
SEO in 2025: How to Prepare for the Future of Search
ipullrank
3
3.5k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin