Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
520
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
dbtのドメイン分割による データ基盤の改善とDigdagとの連携
sakama
0
450
Three ways to use AI on Android: The Good, the Bad and the Ugly
marxallski
0
110
Elm Form Validation
bkuhlmann
0
510
Java 22 Overview
kishida
1
200
2 週間で Twitter Bot を作ってみた
contour_gara
0
770
MicrosoftのPlatform Engineeringガイドを読んで実際になにかやってみた
ymd65536
1
510
Netty Chicago Java User Group 2024-04-17
sullis
0
200
Apache Hive 4 on Treasure Data
ryukobayashi
1
420
Micro Frontends for Java Microservices - Utah JUG 2024
mraible
PRO
1
110
Implementing Design Systems in Swift
seyfoyun
2
470
Anthropic Cookbook のおすすめレシピ
schroneko
7
1.2k
Next.js App Router
quramy
12
1.8k
Featured
See All Featured
Raft: Consensus for Rubyists
vanstee
133
6.3k
Build The Right Thing And Hit Your Dates
maggiecrowley
25
2k
A Philosophy of Restraint
colly
197
16k
Statistics for Hackers
jakevdp
790
220k
Debugging Ruby Performance
tmm1
70
11k
Teambox: Starting and Learning
jrom
128
8.4k
The Brand Is Dead. Long Live the Brand.
mthomps
49
29k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
22
1.4k
Adopting Sorbet at Scale
ufuk
69
8.6k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
155
14k
We Have a Design System, Now What?
morganepeng
43
6.8k
Documentation Writing (for coders)
carmenintech
60
4k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin