Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
590
2
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Other Decks in Programming
See All in Programming
ふつうのFeature Flag実践入門
irof
7
3.9k
AIとASP.NET Coreで雑Webアプリを作った話
mayuki
0
610
Inside Stream API
skrb
1
710
コンテキストの使い捨てをやめる — ビジネスルール駆動開発と miko —
ioki
0
200
その問い、本当に正しいですか?AI時代のエンジニアに必要な哲学と認知科学 / ai-philosophy-cognitive-science
minodriven
7
4.4k
Oxcを導入して開発体験が向上した話
yug1224
4
310
Language Server 使ってる? 〜VSCode と Zed の場合〜 / Are you using a Language Server? ~For VS Code and Zed~
handlename
0
780
脅威をエンジニアリングの糧にして――現場編 / Turning Threats into Engineering Fuel — Field Edition
nrslib
0
280
TSKaigi Night Talks 2026_TypeScriptでサプライチェーンの整合性を型に閉じ込める
geekplus_tech
0
350
代数的データ型って何が嬉しいの? #frontend_phpcon_do
kajitack
8
3.7k
jQueryをバージョンアップする前に使いたいjQuery Migrate
matsuo_atsushi
0
490
エージェンティックRAGにAWSで入門しよう!
har1101
8
1.5k
Featured
See All Featured
Being A Developer After 40
akosma
91
590k
Mobile First: as difficult as doing things right
swwweet
225
10k
Are puppies a ranking factor?
jonoalderson
1
3.5k
Bridging the Design Gap: How Collaborative Modelling removes blockers to flow between stakeholders and teams @FastFlow conf
baasie
0
580
What's in a price? How to price your products and services
michaelherold
247
13k
Navigating Team Friction
lara
192
16k
We Analyzed 250 Million AI Search Results: Here's What I Found
joshbly
1
1.4k
The Cult of Friendly URLs
andyhume
79
6.9k
New Earth Scene 8
popppiees
3
2.3k
Fashionably flexible responsive web design (full day workshop)
malarkey
408
66k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
47
8.2k
[RailsConf 2023] Rails as a piece of cake
palkan
59
6.7k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin