Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
520
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
0→1と1→10の狭間で Javaという技術選定を振り返る/Reflecting on the Decision to Choose Java Between Scaling from 0 to 1 and 1 to 10
jaguar_imo
2
380
SwiftUIで使いやすいToastの作り方 / How to build a Toast system which is easy to use in SwiftUI
lovee
3
130
MetricKitで予期せぬ終了を検知する話 / Detect unexpected termination with MetricKit
nekowen
0
180
Changed Rules: Architectures with Lightweight Stores
manfredsteyer
PRO
0
240
はてなにおける CSS Modules、及び CSS Modules に足りないもの / CSS Modules in Hatena, and CSS Modules missing parts
mizdra
6
890
Azure OpenAI Serviceのプロンプトエンジニアリング入門
tomokusaba
3
620
if constexpr文はテンプレート世界のラムダ式である
faithandbrave
3
630
try! Swift Tokyo 2024 参加報告 / try! Swift Tokyo 2024 Report
hironytic
0
200
Git Lint
bkuhlmann
4
750
Prepare for Jakarta EE 11 - Performance and Developer Productivity
ivargrimstad
0
680
入門 AWS Amplify Gen2 / Introduction to AWS Amplify Gen2
genkiogasawara
1
320
FigmaとPHPで作る1ミリたりとも表示崩れしない最強の帳票印刷ソリューション
ttskch
43
18k
Featured
See All Featured
Making Projects Easy
brettharned
108
5.5k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
6
990
Happy Clients
brianwarren
91
6.4k
Navigating Team Friction
lara
177
13k
Six Lessons from altMBA
skipperchong
20
3k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
352
28k
Building Adaptive Systems
keathley
30
1.9k
Into the Great Unknown - MozCon
thekraken
10
990
Rebuilding a faster, lazier Slack
samanthasiow
72
8.2k
Thoughts on Productivity
jonyablonski
57
3.8k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
5
1.5k
Scaling GitHub
holman
457
140k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin