Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
20260127_試行錯誤の結晶を1冊に。著者が解説 先輩データサイエンティストからの指南書 / author's_commentary_ds_instructions_guide
nash_efp
1
960
AtCoder Conference 2025
shindannin
0
1.1k
Grafana:建立系統全知視角的捷徑
blueswen
0
330
QAフローを最適化し、品質水準を満たしながらリリースまでの期間を最短化する #RSGT2026
shibayu36
2
4.4k
フルサイクルエンジニアリングをAI Agentで全自動化したい 〜構想と現在地〜
kamina_zzz
0
400
ぼくの開発環境2026
yuzneri
0
220
例外処理とどう使い分ける?Result型を使ったエラー設計 #burikaigi
kajitack
16
6.1k
AWS re:Invent 2025参加 直前 Seattle-Tacoma Airport(SEA)におけるハードウェア紛失インシデントLT
tetutetu214
2
110
【卒業研究】会話ログ分析によるユーザーごとの関心に応じた話題提案手法
momok47
0
200
プロダクトオーナーから見たSOC2 _SOC2ゆるミートアップ#2
kekekenta
0
210
AI Schema Enrichment for your Oracle AI Database
thatjeffsmith
0
280
AIによる高速開発をどう制御するか? ガードレール設置で開発速度と品質を両立させたチームの事例
tonkotsuboy_com
7
2.3k
Featured
See All Featured
The World Runs on Bad Software
bkeepers
PRO
72
12k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
31
9.9k
Agile Leadership in an Agile Organization
kimpetersen
PRO
0
80
Crafting Experiences
bethany
1
48
Self-Hosted WebAssembly Runtime for Runtime-Neutral Checkpoint/Restore in Edge–Cloud Continuum
chikuwait
0
320
Color Theory Basics | Prateek | Gurzu
gurzu
0
200
How to optimise 3,500 product descriptions for ecommerce in one day using ChatGPT
katarinadahlin
PRO
0
3.4k
Building Applications with DynamoDB
mza
96
6.9k
sira's awesome portfolio website redesign presentation
elsirapls
0
150
Pawsitive SEO: Lessons from My Dog (and Many Mistakes) on Thriving as a Consultant in the Age of AI
davidcarrasco
0
64
Context Engineering - Making Every Token Count
addyosmani
9
650
The Limits of Empathy - UXLibs8
cassininazir
1
210
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin