Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Insight Project

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

Insight Project

Insight Project for classifying websites

Avatar for mrchristophr

mrchristophr

October 04, 2015
Tweet

Other Decks in Technology

Transcript

  1. The Problem: Classify Websites The web is huge and their

    students have already visited over 14 million different host sites.
  2. The “Data”: Go Guardian provided 2.6 million previously classified urls.

    I collected 50,000 and created a set of 22,752 for modeling.
  3. Developing the Classifier: I trained Random Forests, Naive Bayes, and

    SVM classifiers. Urls Parse Text Random Forest HTML Processing Vectorize TFID Chi2 75% Train (17064) 10 Fold Cross Validation 25% Test (5688)
  4. Christopher Rivera Biochemist/Systems Biologist 0.0 0.1 0.2 0.3 0.4 0

    10 20 30 Day Relative Abundance Phylum Bacteroidetes Caldiserica Chloroflexi Euryarchaeota Firmicutes Proteobacteria Spirochaetes Tenericutes Thermotogae WS6 Phyla over Time in Digester 1 0.3 nce Phylum Bacteroidetes Caldiserica Phyla over Time in Digester 2 Filament.pdf 601 x 40