Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
570
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
チームで開発し事業を加速するための"良い"設計の考え方 @ サポーターズCoLab 2025-07-08
agatan
1
470
テスターからテストエンジニアへ ~新米テストエンジニアが歩んだ9ヶ月振り返り~
non0113
2
220
オンコール⼊⾨〜ページャーが鳴る前に、あなたが備えられること〜 / Before The Pager Rings
yktakaha4
2
980
Android 16KBページサイズ対応をはじめからていねいに
mine2424
0
430
LT 2025-06-30: プロダクトエンジニアの役割
yamamotok
0
870
ペアプロ × 生成AI 現場での実践と課題について / generative-ai-in-pair-programming
codmoninc
2
21k
フロントエンドのパフォーマンスチューニング
koukimiura
5
2k
脱Riverpod?fqueryで考える、TanStack Queryライクなアーキテクチャの可能性
ostk0069
0
500
AWS Summit Japan 2024と2025の比較/はじめてのKiro、今あなたは岐路に立つ
satoshi256kbyte
0
110
レベル1の開発生産性向上に取り組む − 日々の作業の効率化・自動化を通じた改善活動
kesoji
0
290
dbt民主化とLLMによる開発ブースト ~ AI Readyな分析サイクルを目指して ~
yoshyum
3
1.1k
初学者でも今すぐできる、Claude Codeの生産性を10倍上げるTips
s4yuba
16
13k
Featured
See All Featured
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
29
9.6k
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
656
60k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
53
2.9k
KATA
mclloyd
30
14k
Scaling GitHub
holman
460
140k
The Cost Of JavaScript in 2023
addyosmani
51
8.6k
[RailsConf 2023] Rails as a piece of cake
palkan
55
5.7k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
47
9.6k
Being A Developer After 40
akosma
90
590k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
108
19k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
251
21k
GitHub's CSS Performance
jonrohan
1031
460k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin