Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
570
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
2ヶ月で生産性2倍、お買い物アプリ「カウシェ」4チーム同時改善の取り組み
ike002jp
1
110
Cursor/Devin全社導入の理想と現実
saitoryc
28
21k
AIコーディングの理想と現実
tomohisa
35
37k
Ruby's Line Breaks
yui_knk
4
2.7k
ニーリーQAのこれまでとこれから
nealle
2
140
Laravel × Clean Architecture
bumptakayuki
PRO
0
130
Thank you <💅>, What's the Next?
ahoxa
1
590
Orleans + Sekiban + SignalR でリアルタイムWeb作ってみた
tomohisa
0
220
VitestのIn-Source Testingが便利
taro28
8
2.4k
Improve my own Ruby
sisshiki1969
0
100
API for docs
soutaro
3
1.6k
Cursorを活用したAIプログラミングについて 入門
rect
0
150
Featured
See All Featured
The Straight Up "How To Draw Better" Workshop
denniskardys
233
140k
Java REST API Framework Comparison - PWX 2021
mraible
31
8.5k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
507
140k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
13
820
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
137
33k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
26k
A Tale of Four Properties
chriscoyier
158
23k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
53k
Learning to Love Humans: Emotional Interface Design
aarron
273
40k
Documentation Writing (for coders)
carmenintech
69
4.7k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
45
7.2k
RailsConf 2023
tenderlove
30
1.1k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin