Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
【第4回】関東Kaggler会「Kaggleは執筆に役立つ」
mipypf
0
690
Honoアップデート 2025年夏
yusukebe
1
830
Jakarta EE Core Profile and Helidon - Speed, Simplicity, and AI Integration
ivargrimstad
0
140
LLMは麻雀を知らなすぎるから俺が教育してやる
po3rin
3
2.2k
「リーダーは意思決定する人」って本当?~ 学びを現場で活かす、リーダー4ヶ月目の試行錯誤 ~
marina1017
0
230
大規模FlutterプロジェクトのCI実行時間を約8割削減した話
teamlab
PRO
0
490
Flutterと Vibe Coding で個人開発!
hyshu
1
260
兎に角、コードレビュー
mitohato14
0
140
サーバーサイドのビルド時間87倍高速化
plaidtech
PRO
0
370
STUNMESH-go: Wireguard NAT穿隧工具的源起與介紹
tjjh89017
0
380
ゲームの物理
fadis
5
1.5k
TanStack DB ~状態管理の新しい考え方~
bmthd
2
260
Featured
See All Featured
XXLCSS - How to scale CSS and keep your sanity
sugarenia
248
1.3M
YesSQL, Process and Tooling at Scale
rocio
173
14k
Why Our Code Smells
bkeepers
PRO
338
57k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
31
2.2k
Git: the NoSQL Database
bkeepers
PRO
431
65k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.5k
Gamification - CAS2011
davidbonilla
81
5.4k
Automating Front-end Workflow
addyosmani
1370
200k
Docker and Python
trallard
45
3.5k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.4k
Raft: Consensus for Rubyists
vanstee
140
7.1k
[RailsConf 2023] Rails as a piece of cake
palkan
56
5.8k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin