(Really) naive data mining

Slide 1

Slide 1 text

(REALLY) NAIVE DATA MINING By Joël Cox joelcox.nl @joelcox

Slide 2

Slide 2 text

WHAT TO EXPECT? • Intro • Cluster algorithm • Classiﬁcation algorithm • Conclusion

Slide 3

Slide 3 text

A BIT ABOUT ME Information Science undergrad by day, freelancer/tinkerer by night

Slide 4

Slide 4 text

BEFORE WE BEGIN • I’m not a data mining expert • I’ll be showing extremely easy algorithms No learning aspect to these algorithm, that is, they don’t form a model of the data. You can implement these on the train home.

Slide 5

Slide 5 text

CLUSTERING Programmatically divide points in certain groups. Example problem: music taste.

Slide 6

Slide 6 text

K-MEANS select k centroids while centroids move: assign each point to closest centroid recalculate centroids end while Distance is deﬁned as a functions like Euclidean or cosine distance.

Slide 7

Slide 7 text

We pick k = 3 here. This speciﬁc instance needs 5 iterations to converge (i.e. the centroids no longer moves). You will see that the green cluster will slowly but surly reclaim its group of points.

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

CLASSIFICATION • Find the label for a ‘thing’ • If it looks like a duck, swims like a duck and quacks like a duck, then it probably is a duck. Important thing to note here is that you need a training set which already is labelled. Example problem: determine the origin of a wine from its chemical compound.

Slide 13

Slide 13 text

K-NEAREST NEIGHBOR for point in training set: calculate distance to tested point end for determine k closest training points

Slide 14

Slide 14 text

IN ACTION X X X X Y Y Y Y k = 5 We’re trying to classify the point in the center. This algorithm handles noise pretty well; other algorithms which only take the closest point into account would have classiﬁed this point as X, but this algorithms looks for the most frequently occurring and thus classiﬁes the point as Y.

Slide 15

Slide 15 text

PERFORMANCE k = 5 k = 9 Standard 73.4% 72.9% Normalized Normalized + cosine distance * Cross validation for k = 3, 177 training records, 354 testing records 96% 96.6% 95.7% 95.7% If you know the domain and have a bit of mathematical knowledge you can easily tweak those simple algorithms to perform better in your domain. In this example we spotted a lot of variance in the attribute data, so we normalized by making sure the standard deviation for each column is 1 (http://en.wikipedia.org/wiki/ Standard_score).

Slide 16

Slide 16 text

SO DATA MINING IS EASY, RIGHT? Sample bias Supervised learning Incomplete data Decision trees Time complexity Space complexity Over ﬁtting Under ﬁtting Association rule discovery Preprocessing Regression Anomaly detection Visualization Naive Bayes Neural networks Support vector machines I only covered a small part. Lots of things to learn to become a data mining expert.

Slide 17

Slide 17 text

TAKE AWAY Venture outside your own ﬁeld! This is especially true for web devs. New problems, apply this newfound knowledge to your own ﬁeld.

Slide 18

Slide 18 text

LEARN MORE • Orange http://orange.biolab.si/ • “Introduction to Data Mining” by Tan, Steinbach and Kumar • UC Irvine Machine Learning Repository http://archive.ics.uci.edu/ml/ • github.com/joelcox/miner Orange is a GUI application and Python package. Book is probably on of the better academic books I’ve seen, lots of examples. You can use the UC repository for testing data (but preferably your own data!). Also include links to relevant papers. Miner is a toy package, implements both algorithms covered in this talk (not too pretty).

Slide 19

Slide 19 text

THANKS FOR LISTENING! Questions?