Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
vorushin
April 06, 2012
Programming
590
2
Share
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Other Decks in Programming
See All in Programming
ポーリング処理廃止によるイベント駆動アーキテクチャへの移行
seitarof
3
1.3k
どんと来い、データベース信頼性エンジニアリング / Introduction to DBRE
nnaka2992
1
350
ローカルで稼働するAI エージェントを超えて / beyond-local-ai-agents
gawa
1
190
おれのAgentic Coding 2026/03
tsukasagr
1
120
メッセージングを利用して時間的結合を分離しよう #phperkaigi
kajitack
3
510
コードレビューをしない選択 #でぃーぷらすトウキョウ
kajitack
3
1.2k
AI 開発合宿を通して得た学び
niftycorp
PRO
0
180
Claude Code Skill入門
mayahoney
0
450
Codexに役割を持たせる 他のAIエージェントと組み合わせる実務Tips
o8n
4
1.4k
Understanding Apache Lucene - More than just full-text search
spinscale
0
150
存在論的プログラミング: 時間と存在を記述する
koriym
5
590
生成 AI 時代のスナップショットテストってやつを見せてあげますよ(α版)
ojun9
0
320
Featured
See All Featured
Fantastic passwords and where to find them - at NoRuKo
philnash
52
3.6k
Automating Front-end Workflow
addyosmani
1370
200k
Darren the Foodie - Storyboard
khoart
PRO
3
3.1k
Git: the NoSQL Database
bkeepers
PRO
432
67k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
46
2.7k
Balancing Empowerment & Direction
lara
5
1k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
PRO
199
73k
Lightning Talk: Beautiful Slides for Beginners
inesmontani
PRO
1
500
Stewardship and Sustainability of Urban and Community Forests
pwiseman
0
160
Intergalactic Javascript Robots from Outer Space
tanoku
273
27k
Bridging the Design Gap: How Collaborative Modelling removes blockers to flow between stakeholders and teams @FastFlow conf
baasie
0
500
Documentation Writing (for coders)
carmenintech
77
5.3k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin