Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
MUSUBIXとは
nahisaho
0
110
OCaml 5でモダンな並列プログラミングを Enjoyしよう!
haochenx
0
100
Fragmented Architectures
denyspoltorak
0
140
TerraformとStrands AgentsでAmazon Bedrock AgentCoreのSSO認証付きエージェントを量産しよう!
neruneruo
4
2.6k
Package Management Learnings from Homebrew
mikemcquaid
0
150
Claude Codeの「Compacting Conversation」を体感50%減! CLAUDE.md + 8 Skills で挑むコンテキスト管理術
kmurahama
1
830
今から始めるClaude Code超入門
448jp
7
7.3k
[KNOTS 2026登壇資料]AIで拡張‧交差する プロダクト開発のプロセス および携わるメンバーの役割
hisatake
0
200
gunshi
kazupon
1
150
Grafana:建立系統全知視角的捷徑
blueswen
0
310
OSSとなったswift-buildで Xcodeのビルドを差し替えられるため 自分でXcodeを直せる時代になっている ダイアモンド問題編
yimajo
3
590
AIフル活用時代だからこそ学んでおきたい働き方の心得
shinoyu
0
120
Featured
See All Featured
The Impact of AI in SEO - AI Overviews June 2024 Edition
aleyda
5
720
Taking LLMs out of the black box: A practical guide to human-in-the-loop distillation
inesmontani
PRO
3
2k
コードの90%をAIが書く世界で何が待っているのか / What awaits us in a world where 90% of the code is written by AI
rkaga
59
42k
Hiding What from Whom? A Critical Review of the History of Programming languages for Music
tomoyanonymous
2
390
Optimising Largest Contentful Paint
csswizardry
37
3.6k
Technical Leadership for Architectural Decision Making
baasie
1
230
First, design no harm
axbom
PRO
2
1.1k
Exploring anti-patterns in Rails
aemeredith
2
230
Ruling the World: When Life Gets Gamed
codingconduct
0
130
B2B Lead Gen: Tactics, Traps & Triumph
marketingsoph
0
48
Data-driven link building: lessons from a $708K investment (BrightonSEO talk)
szymonslowik
1
900
AI in Enterprises - Java and Open Source to the Rescue
ivargrimstad
0
1.1k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin