Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pandora Ad Classification

Pandora Ad Classification

Summary of a grad class project built to match advertisements to their companies via Machine Learning algorithms.

Greg Ziegan

December 18, 2014
Tweet

More Decks by Greg Ziegan

Other Decks in Programming

Transcript

  1. WHY PANDORA? The company I work for, Vertical Knowledge, works

    on consulting projects for hedge funds, government departments, and other private agencies. Hedge funds want something from us: insights
  2. A company like Pandora is a freemium service. It provides

    a usable platform for free users and incentives to premium members.
  3. Pandora needs to sustain and profit from even its free

    service. The premium members are charged a fee but this revenue does not provide the company with large enough profits.
  4. In order to sustain its free service, Pandora will show

    advertisements from companies who believe these ads will somehow coax the user into visiting their company site.
  5. If we can discover the distribution of ads shown by

    these external companies, we may have a roughly accurate view of what companies have invested in Pandora, and the amount they have invested compared to others.
  6. BUT WAIT, WHY NOT JUST FOLLOW THE AD'S LINK? We

    do not want to alter traffic to these sites. We are classifying across hundreds of stations and geographically distributed IP's Why not just look at the link's url? The ads are embedded in the audio player (making the links hard to find) and are often shortened and made unrecognizable
  7. CLASSIFICATION: THE APPROACH We will take a moment here to

    cite Gary Doran, as he has helped us greatly in understanding unsupervised feature detection and multiple instance algorithms.
  8. STEPS TO SUCCESS 1. Image Segmentation 2. Feature Preparation 3.

    Kernel Selection 4. SVM and MISVM 5. Profit
  9. We sent the pixel data for each image through an

    implementation of the k-means clustering algorithm.
  10. QUICKSHIFT "Segments image using quickshift clustering in Color-(x,y) space. Produces

    an oversegmentation of the image using the quickshift mode-seeking algorithm."
  11. We thought quickshift meant quick. It was not. The algorithm

    took ~5 seconds to segment an image... total clustering time: 40 minutes :(
  12. We were very happy with SLIC and after reviewing the

    third algorithm we decided against it since we couldn't pronounce it.
  13. Once we had clusters, we took the average RGB value

    for each cluster as a feature for the training set.
  14. We discussed adding Gabor wavelets to the clustering algorithms and

    using the refined clusters' RGB values/texture features as the example set.
  15. However, another task at Vertical Knowledge would deal with recognizing

    objects with texture, orientation, and depth. It is very likely more complicated features, including Gabor wavelets, would be needed to classify such an image.
  16. SVM

  17. The multiple instance learner is still being tested on the

    data set. We're currently getting warnings and all zeroes for predictions.
  18. It's not nearly as quick as the SVM, as it

    considers all the instances in a bag with a single image.
  19. The results are showing that some parameter is not tuned

    correctly. We're confident the algorithm will perform at least as well as the SVM since Gary's implementation has been used and resulted in 99% accuracy on a similar data set.
  20. There are 20 other companies with above 30 examples, but

    many from there on have under 5 examples to train on.
  21. We discussed having the SVM be retrained at each new

    labeled instance. Another possibly was to provide a feature of the time the ad was found, weighting more recent ads as more important while still keeping a reasonable training set size.
  22. But while the MISVM tooks magnitudes longer, the SVM took

    less than a minute on 515 images. This means the first suggestion of retraining is very feasible for the time being.
  23. FIN