Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
vorushin
April 06, 2012
Programming
590
2
Share
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Other Decks in Programming
See All in Programming
ソフトウェア設計の結合バランス #phperkaigi
kajitack
0
150
AWSコミュニティ活動は顧客のクラウド推進に効くのか / Do AWS community activities help customers adopt the cloud?
seike460
PRO
0
150
Angular Signal Forms
debug_mode
0
120
PHPer、Cloudflare に引っ越す
suguruooki
1
110
The Monolith Strikes Back: Why AI Agents ❤️ Rails Monoliths
serradura
0
350
エラー処理の温故知新 / history of error handling technic
ryotanakaya
6
1.6k
属人化しないコード品質の作り方_2026.04.07.pdf
muraaano
0
240
SREに優しいTerraform構成 modulesとstateの組み方
hiyanger
2
150
Claude Codeをカスタムして自分だけのClaude Codeを作ろう
terisuke
0
150
CDK Deployのための ”反響定位”
watany
5
860
The Past, Present, and Future of Enterprise Java
ivargrimstad
0
130
JOAI2026 1st solution - heron0519 -
heron0519
0
150
Featured
See All Featured
Agile that works and the tools we love
rasmusluckow
331
21k
Impact Scores and Hybrid Strategies: The future of link building
tamaranovitovic
0
260
Money Talks: Using Revenue to Get Sh*t Done
nikkihalliwell
0
210
Deep Space Network (abreviated)
tonyrice
0
120
Hiding What from Whom? A Critical Review of the History of Programming languages for Music
tomoyanonymous
2
780
Bioeconomy Workshop: Dr. Julius Ecuru, Opportunities for a Bioeconomy in West Africa
akademiya2063
PRO
1
99
Typedesign – Prime Four
hannesfritz
42
3k
Sam Torres - BigQuery for SEOs
techseoconnect
PRO
0
250
Jamie Indigo - Trashchat’s Guide to Black Boxes: Technical SEO Tactics for LLMs
techseoconnect
PRO
0
120
The browser strikes back
jonoalderson
0
990
WCS-LA-2024
lcolladotor
0
550
Technical Leadership for Architectural Decision Making
baasie
3
340
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin