(Really) naive data mining

(REALLY) NAIVE DATA MINING By Joël Cox joelcox.nl @joelcox

WHAT TO EXPECT? • Intro • Cluster algorithm • Classiﬁcation
algorithm • Conclusion

A BIT ABOUT ME Information Science undergrad by day, freelancer/tinkerer
by night

BEFORE WE BEGIN • I’m not a data mining expert
• I’ll be showing extremely easy algorithms No learning aspect to these algorithm, that is, they don’t form a model of the data. You can implement these on the train home.

CLUSTERING Programmatically divide points in certain groups. Example problem: music
taste.

K-MEANS select k centroids while centroids move: assign each point
to closest centroid recalculate centroids end while Distance is deﬁned as a functions like Euclidean or cosine distance.

We pick k = 3 here. This speciﬁc instance needs
5 iterations to converge (i.e. the centroids no longer moves). You will see that the green cluster will slowly but surly reclaim its group of points.

CLASSIFICATION • Find the label for a ‘thing’ • If
it looks like a duck, swims like a duck and quacks like a duck, then it probably is a duck. Important thing to note here is that you need a training set which already is labelled. Example problem: determine the origin of a wine from its chemical compound.

K-NEAREST NEIGHBOR for point in training set: calculate distance to
tested point end for determine k closest training points

IN ACTION X X X X Y Y Y Y
k = 5 We’re trying to classify the point in the center. This algorithm handles noise pretty well; other algorithms which only take the closest point into account would have classiﬁed this point as X, but this algorithms looks for the most frequently occurring and thus classiﬁes the point as Y.

PERFORMANCE k = 5 k = 9 Standard 73.4% 72.9%
Normalized Normalized + cosine distance * Cross validation for k = 3, 177 training records, 354 testing records 96% 96.6% 95.7% 95.7% If you know the domain and have a bit of mathematical knowledge you can easily tweak those simple algorithms to perform better in your domain. In this example we spotted a lot of variance in the attribute data, so we normalized by making sure the standard deviation for each column is 1 (http://en.wikipedia.org/wiki/ Standard_score).

SO DATA MINING IS EASY, RIGHT? Sample bias Supervised learning
Incomplete data Decision trees Time complexity Space complexity Over ﬁtting Under ﬁtting Association rule discovery Preprocessing Regression Anomaly detection Visualization Naive Bayes Neural networks Support vector machines I only covered a small part. Lots of things to learn to become a data mining expert.

TAKE AWAY Venture outside your own ﬁeld! This is especially
true for web devs. New problems, apply this newfound knowledge to your own ﬁeld.

LEARN MORE • Orange http://orange.biolab.si/ • “Introduction to Data Mining”
by Tan, Steinbach and Kumar • UC Irvine Machine Learning Repository http://archive.ics.uci.edu/ml/ • github.com/joelcox/miner Orange is a GUI application and Python package. Book is probably on of the better academic books I’ve seen, lots of examples. You can use the UC repository for testing data (but preferably your own data!). Also include links to relevant papers. Miner is a toy package, implements both algorithms covered in this talk (not too pretty).

THANKS FOR LISTENING! Questions?

(Really) naive data mining

(Really) naive data mining

Joël Cox

More Decks by Joël Cox

Other Decks in Programming

Featured

Transcript

(REALLY) NAIVE DATA MINING By Joël Cox joelcox.nl @joelcox

WHAT TO EXPECT? • Intro • Cluster algorithm • Classiﬁcation

A BIT ABOUT ME Information Science undergrad by day, freelancer/tinkerer

BEFORE WE BEGIN • I’m not a data mining expert

CLUSTERING Programmatically divide points in certain groups. Example problem: music

K-MEANS select k centroids while centroids move: assign each point

We pick k = 3 here. This speciﬁc instance needs

CLASSIFICATION • Find the label for a ‘thing’ • If

K-NEAREST NEIGHBOR for point in training set: calculate distance to

IN ACTION X X X X Y Y Y Y

PERFORMANCE k = 5 k = 9 Standard 73.4% 72.9%

SO DATA MINING IS EASY, RIGHT? Sample bias Supervised learning

TAKE AWAY Venture outside your own ﬁeld! This is especially

LEARN MORE • Orange http://orange.biolab.si/ • “Introduction to Data Mining”

THANKS FOR LISTENING! Questions?