Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
Building AI with AI
inesmontani
PRO
1
260
仕様がそのままテストになる!Javaで始める振る舞い駆動開発
ohmori_yusuke
8
4.7k
スタートアップを支える技術戦略と組織づくり
pospome
8
12k
開発15年のAIネイティブでない 巨大サービスのAI最適化
rapicro
0
100
『実践MLOps』から学ぶ DevOps for ML
nsakki55
2
480
競馬で学ぶ機械学習の基本と実践 / Machine Learning with Horse Racing
shoheimitani
14
14k
最新のDirectX12で使えるレイトレ周りの機能追加について
projectasura
0
300
関数の挙動書き換える
takatofukui
4
750
30分でDoctrineの仕組みと使い方を完全にマスターする / phpconkagawa 2025 Doctrine
ttskch
3
560
「正規表現をつくる」をつくる / make "make regex"
makenowjust
1
860
Vueで学ぶデータ構造入門 リンクリストとキューでリアクティビティを捉える / Vue Data Structures: Linked Lists and Queues for Reactivity
konkarin
1
350
GraalVM Native Image トラブルシューティング機能の最新状況(2025年版)
ntt_dsol_java
0
170
Featured
See All Featured
GitHub's CSS Performance
jonrohan
1032
470k
Faster Mobile Websites
deanohume
310
31k
A Modern Web Designer's Workflow
chriscoyier
697
190k
Designing for Performance
lara
610
69k
KATA
mclloyd
PRO
32
15k
Rebuilding a faster, lazier Slack
samanthasiow
84
9.3k
Testing 201, or: Great Expectations
jmmastey
46
7.8k
Java REST API Framework Comparison - PWX 2021
mraible
34
9k
jQuery: Nuts, Bolts and Bling
dougneiner
65
8k
4 Signs Your Business is Dying
shpigford
186
22k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
34
2.5k
Six Lessons from altMBA
skipperchong
29
4.1k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin