Lock in $30 Savings on PRO—Offer Ends Soon! ⏳
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
AIエージェントを活かすPM術 AI駆動開発の現場から
gyuta
0
430
愛される翻訳の秘訣
kishikawakatsumi
3
330
LLMで複雑な検索条件アセットから脱却する!! 生成的検索インタフェースの設計論
po3rin
3
810
AIコーディングエージェント(Gemini)
kondai24
0
230
MAP, Jigsaw, Code Golf 振り返り会 by 関東Kaggler会|Jigsaw 15th Solution
hasibirok0
0
250
AWS CDKの推しポイントN選
akihisaikeda
1
240
Tinkerbellから学ぶ、Podで DHCPをリッスンする手法
tomokon
0
130
LLM Çağında Backend Olmak: 10 Milyon Prompt'u Milisaniyede Sorgulamak
selcukusta
0
120
ID管理機能開発の裏側 高速にSaaS連携を実現したチームのAI活用編
atzzcokek
0
230
マスタデータ問題、マイクロサービスでどう解くか
kts
0
110
Canon EOS R50 V と R5 Mark II 購入でみえてきた最近のデジイチ VR180 事情、そして VR180 静止画に活路を見出すまで
karad
0
120
251126 TestState APIってなんだっけ?Step Functionsテストどう変わる?
east_takumi
0
320
Featured
See All Featured
Building a Modern Day E-commerce SEO Strategy
aleyda
45
8.3k
What's in a price? How to price your products and services
michaelherold
246
13k
Building Adaptive Systems
keathley
44
2.9k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
47
7.9k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
35
3.3k
The Straight Up "How To Draw Better" Workshop
denniskardys
239
140k
Raft: Consensus for Rubyists
vanstee
141
7.2k
Building Applications with DynamoDB
mza
96
6.8k
Designing for Performance
lara
610
69k
Thoughts on Productivity
jonyablonski
73
5k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
31
9.8k
Balancing Empowerment & Direction
lara
5
800
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin