Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
Giselleで作るAI QAアシスタント 〜 Pull Requestレビューに継続的QAを
codenote
0
330
まだ間に合う!Claude Code元年をふりかえる
nogu66
5
920
Vibe codingでおすすめの言語と開発手法
uyuki234
0
160
Graviton と Nitro と私
maroon1st
0
160
実は歴史的なアップデートだと思う AWS Interconnect - multicloud
maroon1st
0
290
メルカリのリーダビリティチームが取り組む、AI時代のスケーラブルな品質文化
cloverrose
2
440
Grafana:建立系統全知視角的捷徑
blueswen
0
270
Jetpack XR SDKから紐解くAndroid XR開発と技術選定のヒント / about-androidxr-and-jetpack-xr-sdk
drumath2237
1
230
Spinner 軸ズレ現象を調べたらレンダリング深淵に飲まれた #レバテックMeetup
bengo4com
1
210
The Art of Re-Architecture - Droidcon India 2025
siddroid
0
150
Navigating Dependency Injection with Metro
l2hyunwoo
1
200
Developing static sites with Ruby
okuramasafumi
0
340
Featured
See All Featured
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.6k
Docker and Python
trallard
47
3.7k
From Legacy to Launchpad: Building Startup-Ready Communities
dugsong
0
120
Believing is Seeing
oripsolob
0
19
State of Search Keynote: SEO is Dead Long Live SEO
ryanjones
0
80
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
132
19k
SERP Conf. Vienna - Web Accessibility: Optimizing for Inclusivity and SEO
sarafernandez
1
1.3k
Imperfection Machines: The Place of Print at Facebook
scottboms
269
13k
The innovator’s Mindset - Leading Through an Era of Exponential Change - McGill University 2025
jdejongh
PRO
1
74
New Earth Scene 8
popppiees
0
1.3k
Fireside Chat
paigeccino
41
3.8k
The Illustrated Guide to Node.js - THAT Conference 2024
reverentgeek
0
220
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin