Upgrade to Pro — share decks privately, control downloads, hide ads and more …

(Really) naive data mining

Joël Cox
September 21, 2012

(Really) naive data mining

Presented at Python Users Nederland (PUN)

More from me about business and technology.

Joël Cox

September 21, 2012
Tweet

More Decks by Joël Cox

Other Decks in Programming

Transcript

  1. BEFORE WE BEGIN • I’m not a data mining expert

    • I’ll be showing extremely easy algorithms No learning aspect to these algorithm, that is, they don’t form a model of the data. You can implement these on the train home.
  2. K-MEANS select k centroids while centroids move: assign each point

    to closest centroid recalculate centroids end while Distance is defined as a functions like Euclidean or cosine distance.
  3. We pick k = 3 here. This specific instance needs

    5 iterations to converge (i.e. the centroids no longer moves). You will see that the green cluster will slowly but surly reclaim its group of points.
  4. CLASSIFICATION • Find the label for a ‘thing’ • If

    it looks like a duck, swims like a duck and quacks like a duck, then it probably is a duck. Important thing to note here is that you need a training set which already is labelled. Example problem: determine the origin of a wine from its chemical compound.
  5. K-NEAREST NEIGHBOR for point in training set: calculate distance to

    tested point end for determine k closest training points
  6. IN ACTION X X X X Y Y Y Y

    k = 5 We’re trying to classify the point in the center. This algorithm handles noise pretty well; other algorithms which only take the closest point into account would have classified this point as X, but this algorithms looks for the most frequently occurring and thus classifies the point as Y.
  7. PERFORMANCE k = 5 k = 9 Standard 73.4% 72.9%

    Normalized Normalized + cosine distance * Cross validation for k = 3, 177 training records, 354 testing records 96% 96.6% 95.7% 95.7% If you know the domain and have a bit of mathematical knowledge you can easily tweak those simple algorithms to perform better in your domain. In this example we spotted a lot of variance in the attribute data, so we normalized by making sure the standard deviation for each column is 1 (http://en.wikipedia.org/wiki/ Standard_score).
  8. SO DATA MINING IS EASY, RIGHT? Sample bias Supervised learning

    Incomplete data Decision trees Time complexity Space complexity Over fitting Under fitting Association rule discovery Preprocessing Regression Anomaly detection Visualization Naive Bayes Neural networks Support vector machines I only covered a small part. Lots of things to learn to become a data mining expert.
  9. TAKE AWAY Venture outside your own field! This is especially

    true for web devs. New problems, apply this newfound knowledge to your own field.
  10. LEARN MORE • Orange http://orange.biolab.si/ • “Introduction to Data Mining”

    by Tan, Steinbach and Kumar • UC Irvine Machine Learning Repository http://archive.ics.uci.edu/ml/ • github.com/joelcox/miner Orange is a GUI application and Python package. Book is probably on of the better academic books I’ve seen, lots of examples. You can use the UC repository for testing data (but preferably your own data!). Also include links to relevant papers. Miner is a toy package, implements both algorithms covered in this talk (not too pretty).