Slide 1

Slide 1 text

Classifying the Web with GoGuardian Christopher Rivera

Slide 2

Slide 2 text

The Company: GoGuardian GoGuardian builds software to help schools and teachers monitor student web usage.

Slide 3

Slide 3 text

The Problem: Classify Websites The web is huge and their students have already visited over 14 million different host sites.

Slide 4

Slide 4 text

The “Data”: Go Guardian provided 2.6 million previously classified urls. I collected 50,000 and created a set of 22,752 for modeling.

Slide 5

Slide 5 text

Developing the Classifier: I trained Random Forests, Naive Bayes, and SVM classifiers. Urls Parse Text Random Forest HTML Processing Vectorize TFID Chi2 75% Train (17064) 10 Fold Cross Validation 25% Test (5688)

Slide 6

Slide 6 text

Random Forest Performance Test Accuracy: 62.5% Random Accuracy: 7.01%. 7.4 x Lift.

Slide 7

Slide 7 text

The classifier classifies my favorites: To validate, I looked at several websites manually.

Slide 8

Slide 8 text

The classifier classifies my favorites: To validate, I looked at several websites manually.

Slide 9

Slide 9 text

The classifier classifies my favorites: To validate, I looked at several websites manually.

Slide 10

Slide 10 text

Sites the Students Visit: 2 weeks of records for 600 students.

Slide 11

Slide 11 text

Christopher Rivera Biochemist/Systems Biologist 0.0 0.1 0.2 0.3 0.4 0 10 20 30 Day Relative Abundance Phylum Bacteroidetes Caldiserica Chloroflexi Euryarchaeota Firmicutes Proteobacteria Spirochaetes Tenericutes Thermotogae WS6 Phyla over Time in Digester 1 0.3 nce Phylum Bacteroidetes Caldiserica Phyla over Time in Digester 2 Filament.pdf 601 x 40

Slide 12

Slide 12 text

Sites the Students Visit: 2 weeks of records for 600 students.

Slide 13

Slide 13 text

Random Forest Performance Test Accuracy: 62.5% Random Accuracy: 7.01%. 7.4 x Lift.