Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
Migration to Signals, Resource API, and NgRx Signal Store
manfredsteyer
PRO
0
130
Amazon ECS Managed Instances が リリースされた!キャッチアップしよう!! / Let's catch up Amazon ECS Managed Instances
cocoeyes02
0
110
AI 駆動開発におけるコミュニティと AWS CDK の価値
konokenj
5
300
Introducing RemoteCompose: break your UI out of the app sandbox.
camaelon
2
160
When Dependencies Fail: Building Antifragile Applications in a Fragile World
selcukusta
0
120
Ktorで簡単AIアプリケーション
tsukakei
0
120
マイベストのシンプルなデータ基盤の話 - Googleスイートとのつき合い方 / mybest-simple-data-architecture-google-nized
snhryt
0
110
Blazing Fast UI Development with Compose Hot Reload (droidcon London 2025)
zsmb
0
420
ALL CODE BASE ARE BELONG TO STUDY
uzulla
28
6.8k
Go言語はstack overflowの夢を見るか?
logica0419
0
660
AIと人間の共創開発!OSSで試行錯誤した開発スタイル
mae616
2
840
CSC305 Lecture 10
javiergs
PRO
0
320
Featured
See All Featured
Navigating Team Friction
lara
190
15k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
1.7k
Building a Scalable Design System with Sketch
lauravandoore
463
33k
Balancing Empowerment & Direction
lara
5
710
A better future with KSS
kneath
239
18k
Product Roadmaps are Hard
iamctodd
PRO
55
11k
Being A Developer After 40
akosma
91
590k
Building Better People: How to give real-time feedback that sticks.
wjessup
370
20k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
46
2.5k
Designing for Performance
lara
610
69k
4 Signs Your Business is Dying
shpigford
186
22k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
253
22k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin