Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
vorushin
April 06, 2012
Programming
590
2
Share
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Other Decks in Programming
See All in Programming
エラー処理の温故知新 / history of error handling technic
ryotanakaya
6
1.6k
The Less-Told Story of Socket Timeouts
coe401_
3
650
의존성 주입과 모듈화
fornewid
0
150
AIエージェントで業務改善してみた
taku271
0
540
TiDBのアーキテクチャから学ぶ分散システム入門 〜MySQL互換のNewSQLは何を解決するのか〜 / tidb-architecture-study
dznbk
1
190
ハーネスエンジニアリングとは?
kinopeee
13
6.2k
Claude Codeをカスタムして自分だけのClaude Codeを作ろう
terisuke
0
150
CDK Deployのための ”反響定位”
watany
5
860
10 Tips of AWS ~Gen AI on AWS~
licux
5
470
Programming with a DJ Controller — not vibe coding
m_seki
3
260
Making the RBS Parser Faster
soutaro
0
540
PHPer、Cloudflare に引っ越す
suguruooki
1
110
Featured
See All Featured
Winning Ecommerce Organic Search in an AI Era - #searchnstuff2025
aleyda
1
2k
A Tale of Four Properties
chriscoyier
163
24k
Mind Mapping
helmedeiros
PRO
1
170
Thoughts on Productivity
jonyablonski
76
5.1k
Un-Boring Meetings
codingconduct
0
270
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
34
2.7k
Tell your own story through comics
letsgokoyo
1
900
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.6k
職位にかかわらず全員がリーダーシップを発揮するチーム作り / Building a team where everyone can demonstrate leadership regardless of position
madoxten
62
53k
Navigating Team Friction
lara
192
16k
Data-driven link building: lessons from a $708K investment (BrightonSEO talk)
szymonslowik
1
1k
技術選定の審美眼(2025年版) / Understanding the Spiral of Technologies 2025 edition
twada
PRO
118
110k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin