Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
NOT A HOTEL - 建築や人と融合し、自由を創り出すソフトウェア
not_a_hokuts
2
970
米国のサイバーセキュリティタイムラインと見る Goの暗号パッケージの進化
tomtwinkle
2
540
Agent Skills Workshop - AIへの頼み方を仕組み化する
gotalab555
15
8.3k
20260228_JAWS_Beginner_Kansai
takuyay0ne
5
470
どんと来い、データベース信頼性エンジニアリング / Introduction to DBRE
nnaka2992
1
260
AI主導でFastAPIのWebサービスを作るときに 人間が構造化すべき境界線
okajun35
0
670
Ruby x Terminal
a_matsuda
7
590
Go Conference mini in Sendai 2026 : Goに新機能を提案し実装されるまでのフロー徹底解説
yamatoya
0
550
go directiveを最新にしすぎないで欲しい話──あるいは、Go 1.26からgo mod initで作られるgo directiveの値が変わる話 / Go 1.26 リリースパーティ
arthur1
2
520
AI時代でも変わらない技術コミュニティの力~10年続く“ゆるい”つながりが生み出す価値
n_takehata
2
690
Railsの気持ちを考えながらコントローラとビューを整頓する/tidying-rails-controllers-and-views-as-rails-think
moro
5
390
AI活用のコスパを最大化する方法
ochtum
0
130
Featured
See All Featured
Facilitating Awesome Meetings
lara
57
6.8k
Neural Spatial Audio Processing for Sound Field Analysis and Control
skoyamalab
0
200
Noah Learner - AI + Me: how we built a GSC Bulk Export data pipeline
techseoconnect
PRO
0
130
The Cost Of JavaScript in 2023
addyosmani
55
9.8k
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs
inesmontani
PRO
3
3.1k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
16k
What's in a price? How to price your products and services
michaelherold
247
13k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
287
14k
Testing 201, or: Great Expectations
jmmastey
46
8.1k
What the history of the web can teach us about the future of AI
inesmontani
PRO
1
460
Producing Creativity
orderedlist
PRO
348
40k
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
659
61k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin