Filtering n-grams using Machine Learning

Filtering n-‐grams using Machine Learning

Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai

Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai

Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot

Data •  Good data: wikaonary words • 
Bad data: words ﬁltered out by regexps •  Features – length of word – count of uppercase chars (excluding ﬁrst one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency

Details •  scikit-‐learn – python library for machine
learning •  SVM with Gaussian kernel •  O(# of features * N2) – O(# of features * N3) •  100k items in training data => 5 min on 2 Ghz •  F1 = 0.98

Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin

Filtering n-grams using Machine Learning

Filtering n-grams using Machine Learning

vorushin

Other Decks in Programming

Featured

Transcript

Filtering n-‐grams using Machine Learning

Unsorted unigrams. 13M closetohome CMX309FLC AZ3

Filtered with regexps, 10M closetohome lehanga Mirabadi

Filtered with SVM, 2.5M closetohome lehanga Mirabadi

Data •  Good data: wikaonary words •

Details •  scikit-‐learn – python library for machine

Thank you! Roman Vorushin, Grammarly Inc.