Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
AIと一緒にレガシーに向き合ってみた
nyafunta9858
0
250
なぜSQLはAIぽく見えるのか/why does SQL look AI like
florets1
0
480
並行開発のためのコードレビュー
miyukiw
0
1.1k
Honoを使ったリモートMCPサーバでAIツールとの連携を加速させる!
tosuri13
1
180
QAフローを最適化し、品質水準を満たしながらリリースまでの期間を最短化する #RSGT2026
shibayu36
2
4.4k
Data-Centric Kaggle
isax1015
2
780
【卒業研究】会話ログ分析によるユーザーごとの関心に応じた話題提案手法
momok47
0
200
それ、本当に安全? ファイルアップロードで見落としがちなセキュリティリスクと対策
penpeen
7
4k
そのAIレビュー、レビューしてますか? / Are you reviewing those AI reviews?
rkaga
6
4.6k
Gemini for developers
meteatamel
0
100
React 19でつくる「気持ちいいUI」- 楽観的UIのすすめ
himorishige
11
7.5k
AIによる開発の民主化を支える コンテキスト管理のこれまでとこれから
mulyu
3
470
Featured
See All Featured
Keith and Marios Guide to Fast Websites
keithpitt
413
23k
Avoiding the “Bad Training, Faster” Trap in the Age of AI
tmiket
0
79
Visual Storytelling: How to be a Superhuman Communicator
reverentgeek
2
440
How To Stay Up To Date on Web Technology
chriscoyier
791
250k
Claude Code どこまでも/ Claude Code Everywhere
nwiizo
61
52k
Navigating Team Friction
lara
192
16k
職位にかかわらず全員がリーダーシップを発揮するチーム作り / Building a team where everyone can demonstrate leadership regardless of position
madoxten
58
50k
SEO Brein meetup: CTRL+C is not how to scale international SEO
lindahogenes
0
2.4k
Kristin Tynski - Automating Marketing Tasks With AI
techseoconnect
PRO
0
150
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
231
22k
A brief & incomplete history of UX Design for the World Wide Web: 1989–2019
jct
1
300
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.7k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin