Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Insight Project

Insight Project

Insight Project for classifying websites

mrchristophr

October 04, 2015
Tweet

Other Decks in Technology

Transcript

  1. The Problem: Classify Websites The web is huge and their

    students have already visited over 14 million different host sites.
  2. The “Data”: Go Guardian provided 2.6 million previously classified urls.

    I collected 50,000 and created a set of 22,752 for modeling.
  3. Developing the Classifier: I trained Random Forests, Naive Bayes, and

    SVM classifiers. Urls Parse Text Random Forest HTML Processing Vectorize TFID Chi2 75% Train (17064) 10 Fold Cross Validation 25% Test (5688)
  4. Christopher Rivera Biochemist/Systems Biologist 0.0 0.1 0.2 0.3 0.4 0

    10 20 30 Day Relative Abundance Phylum Bacteroidetes Caldiserica Chloroflexi Euryarchaeota Firmicutes Proteobacteria Spirochaetes Tenericutes Thermotogae WS6 Phyla over Time in Digester 1 0.3 nce Phylum Bacteroidetes Caldiserica Phyla over Time in Digester 2 Filament.pdf 601 x 40