Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
今から始めるClaude Code超入門
448jp
8
8.6k
【卒業研究】会話ログ分析によるユーザーごとの関心に応じた話題提案手法
momok47
0
200
izumin5210のプロポーザルのネタ探し #tskaigi_msup
izumin5210
1
110
Fluid Templating in TYPO3 14
s2b
0
130
Apache Iceberg V3 and migration to V3
tomtanaka
0
160
それ、本当に安全? ファイルアップロードで見落としがちなセキュリティリスクと対策
penpeen
7
3.9k
AI時代の認知負荷との向き合い方
optfit
0
160
FOSDEM 2026: STUNMESH-go: Building P2P WireGuard Mesh Without Self-Hosted Infrastructure
tjjh89017
0
160
LLM Observabilityによる 対話型音声AIアプリケーションの安定運用
gekko0114
2
430
CSC307 Lecture 02
javiergs
PRO
1
780
AgentCoreとHuman in the Loop
har1101
5
230
OCaml 5でモダンな並列プログラミングを Enjoyしよう!
haochenx
0
140
Featured
See All Featured
Reflections from 52 weeks, 52 projects
jeffersonlam
356
21k
The browser strikes back
jonoalderson
0
370
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
2.1k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
254
22k
Navigating Team Friction
lara
192
16k
Darren the Foodie - Storyboard
khoart
PRO
2
2.4k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
49
9.9k
Organizational Design Perspectives: An Ontology of Organizational Design Elements
kimpetersen
PRO
1
180
Are puppies a ranking factor?
jonoalderson
1
2.7k
Stop Working from a Prison Cell
hatefulcrawdad
273
21k
16th Malabo Montpellier Forum Presentation
akademiya2063
PRO
0
49
Jess Joyce - The Pitfalls of Following Frameworks
techseoconnect
PRO
1
64
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin