Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
16年目のピクシブ百科事典を支える最新の技術基盤 / The Modern Tech Stack Powering Pixiv Encyclopedia in its 16th Year
ahuglajbclajep
5
950
AIエージェントの設計で注意するべきポイント6選
har1101
7
3.3k
CSC307 Lecture 03
javiergs
PRO
1
490
20260127_試行錯誤の結晶を1冊に。著者が解説 先輩データサイエンティストからの指南書 / author's_commentary_ds_instructions_guide
nash_efp
0
800
Python札幌 LT資料
t3tra
7
1.1k
CSC307 Lecture 05
javiergs
PRO
0
490
Apache Iceberg V3 and migration to V3
tomtanaka
0
120
AIエージェント、”どう作るか”で差は出るか? / AI Agents: Does the "How" Make a Difference?
rkaga
4
1.9k
Vibe codingでおすすめの言語と開発手法
uyuki234
0
210
Architectural Extensions
denyspoltorak
0
250
CSC307 Lecture 04
javiergs
PRO
0
650
AI Agent の開発と運用を支える Durable Execution #AgentsInProd
izumin5210
7
2.2k
Featured
See All Featured
Prompt Engineering for Job Search
mfonobong
0
150
Fireside Chat
paigeccino
41
3.8k
The Power of CSS Pseudo Elements
geoffreycrofte
80
6.1k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
254
22k
It's Worth the Effort
3n
188
29k
Art, The Web, and Tiny UX
lynnandtonic
304
21k
First, design no harm
axbom
PRO
2
1.1k
Building Experiences: Design Systems, User Experience, and Full Site Editing
marktimemedia
0
400
AI: The stuff that nobody shows you
jnunemaker
PRO
2
220
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Testing 201, or: Great Expectations
jmmastey
46
8k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
12
1k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin