Classifying the Web
with GoGuardian
Christopher Rivera
Slide 2
Slide 2 text
The Company: GoGuardian
GoGuardian builds software to help schools and teachers
monitor student web usage.
Slide 3
Slide 3 text
The Problem: Classify Websites
The web is huge and their students have already visited over 14
million different host sites.
Slide 4
Slide 4 text
The “Data”:
Go Guardian provided 2.6 million previously classified urls.
I collected 50,000 and created a set of 22,752 for modeling.
Slide 5
Slide 5 text
Developing the Classifier:
I trained Random Forests, Naive Bayes, and SVM classifiers.
Urls
Parse Text
Random
Forest
HTML
Processing
Vectorize
TFID
Chi2
75% Train
(17064)
10 Fold
Cross
Validation
25% Test
(5688)
Slide 6
Slide 6 text
Random Forest Performance
Test Accuracy: 62.5% Random Accuracy: 7.01%. 7.4 x Lift.
Slide 7
Slide 7 text
The classifier classifies my favorites:
To validate, I looked at several websites manually.
Slide 8
Slide 8 text
The classifier classifies my favorites:
To validate, I looked at several websites manually.
Slide 9
Slide 9 text
The classifier classifies my favorites:
To validate, I looked at several websites manually.
Slide 10
Slide 10 text
Sites the Students Visit:
2 weeks of records for 600 students.
Slide 11
Slide 11 text
Christopher Rivera
Biochemist/Systems Biologist
0.0
0.1
0.2
0.3
0.4
0 10 20 30
Day
Relative Abundance
Phylum
Bacteroidetes
Caldiserica
Chloroflexi
Euryarchaeota
Firmicutes
Proteobacteria
Spirochaetes
Tenericutes
Thermotogae
WS6
Phyla over Time in Digester 1
0.3
nce
Phylum
Bacteroidetes
Caldiserica
Phyla over Time in Digester 2
Filament.pdf
601 x 40
Slide 12
Slide 12 text
Sites the Students Visit:
2 weeks of records for 600 students.
Slide 13
Slide 13 text
Random Forest Performance
Test Accuracy: 62.5% Random Accuracy: 7.01%. 7.4 x Lift.