Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
How Software Deployment tools have changed in the past 20 years
geshan
0
14k
Herb to ReActionView: A New Foundation for the View Layer @ San Francisco Ruby Conference 2025
marcoroth
0
200
チーム開発の “地ならし"
konifar
8
6.2k
モダンJSフレームワークのビルドプロセス 〜なぜReactは503行、Svelteは12行なのか〜
fuuki12
0
120
dnx で実行できるコマンド、作ってみました
tomohisa
0
110
Stay Hacker 〜九州で生まれ、Perlに出会い、コミュニティで育つ〜
pyama86
2
2.7k
OSS開発者の憂鬱
yusukebe
14
11k
なあ兄弟、 余白の意味を考えてから UI実装してくれ!
ktcryomm
5
1.5k
AWS CDKの推しポイントN選
akihisaikeda
1
210
GraalVM Native Image トラブルシューティング機能の最新状況(2025年版)
ntt_dsol_java
0
170
Chart.jsで長い項目を表示するときのハマりどころ
yumechi
0
160
Combinatorial Interview Problems with Backtracking Solutions - From Imperative Procedural Programming to Declarative Functional Programming - Part 1
philipschwarz
PRO
0
110
Featured
See All Featured
Build your cross-platform service in a week with App Engine
jlugia
234
18k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
27k
[RailsConf 2023] Rails as a piece of cake
palkan
57
6.1k
Building Flexible Design Systems
yeseniaperezcruz
329
39k
Building a Modern Day E-commerce SEO Strategy
aleyda
45
8.1k
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
9
980
How To Stay Up To Date on Web Technology
chriscoyier
791
250k
Speed Design
sergeychernyshev
33
1.3k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
34
2.3k
Writing Fast Ruby
sferik
630
62k
Six Lessons from altMBA
skipperchong
29
4.1k
Agile that works and the tools we love
rasmusluckow
331
21k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin