ISI Programming Course - 06 - Scikit Learn

Jungwon Seo | University of Stavanger ISI Programming Class Scikit
Learn

Machine Learning Machine Learning Supervised Learning Unsupervised Learning Classiﬁcation Regression
Clustering Association

Supervised vs. Unsupervised Supervised Learning Unsupervised Learning Known number of
classes Unknown number of classes Uses training dataset Uses input dataset For prediction For analysis

Supervised Learning {features, class} Model {features} Model Class A? Class
B? Class C? Training Testing Result Training Data Testing Data

Classiﬁer vs. Regressor Trained Classiﬁer Trained Regressor Pass / Fail
Score   ( e.g. 95/100) Student info Student info

Unsupervised Learning {features} Algorithm Data Class A? Class B? Class
C? Generating Classes

Clustering vs. Association Clustering Association

In this lecture, we will focus on classiﬁcation!

Classification algorithms • Linear Classifiers: Logistic Regression, Naive Bayes Classifier.
• Decision Trees • Random Forest • Support Vector Machines • Neural Networks • Nearest Neighbor • Genetic Algorithm • …….

Logistic Regression • To ﬁnd weights (B) we train the
model. • The goal is making function that divide the data the best.

Again! We won’t implement this algorithm. We will use implemented
one.

Naive Bayes Classiﬁer • In this case, training data is
used for making P(Xn | c).

Decision Tree <Mansplaining Decision Tree> Did she ask? Not mansplaining
Do you know better than her? Did you ask   if she needed it explained? Not mansplaining Mansplaining Mansplaining Mansplaining Yes Yes Yes No About the same No No

Decision Tree <With numerical data>

Decision Tree • The most important part is determining the
split node. • e.g. : Gender ? F : M, Height ? over 170 : under 170. • There are several index that you can use. • e.g. : Entropy, Gini, Misclassiﬁcation error • The training data is used for building the tree.

Random Forest

Random Forest Test data Class A Class B Class A
Tree 1 Tree 2 Tree 3 Class A

Random Forest • The randomness aﬀects while building each tree.
• It is improved version of decision tree • But! You can’t explain about the process. (Blackbox)

Support Vector Machine 1. Find the border with the largest
margin 2. For non-linear case, mapping inputs into high-dimensional feature spaces.

SVM vs. Logistic Reg. • Logistic regression focuses on maximizing
the probability of the data. The farther the data lies from the separating hyperplane (on the correct side), the happier LR is. • An SVM tries to ﬁnd the separating hyperplane that maximizes the distance of the closest points to the margin (the support vectors). If a point is not a support vector, it doesn’t really matter. http://www.cs.toronto.edu/~kswersky/wp-content/uploads/svm_vs_lr.pdf

K-Nearest Neighbor 1. How to determine K? 2. What kind
of distance shall we use? Euclidean? Manhattan?

Things to be careful • Overﬁtting • Outlier

How can I choose the right algorithm?

Let’s get started!

1. Scikit learn • Preprocessing • Cross validation • Classiﬁers
Source Code

2. Calculating similarities • Minkowski Distance • Manhattan Distance •
Euclidean Distance • Chebyshev Distance • Cosine Distance Source Code

Thank you

ISI Programming Course - 06 - Scikit Learn

ISI Programming Course - 06 - Scikit Learn

Jungwon Seo

More Decks by Jungwon Seo

Other Decks in Technology

Featured

Transcript

Jungwon Seo | University of Stavanger ISI Programming Class Scikit

Machine Learning Machine Learning Supervised Learning Unsupervised Learning Classiﬁcation Regression

Supervised vs. Unsupervised Supervised Learning Unsupervised Learning Known number of

Supervised Learning {features, class} Model {features} Model Class A? Class

Classiﬁer vs. Regressor Trained Classiﬁer Trained Regressor Pass / Fail

Unsupervised Learning {features} Algorithm Data Class A? Class B? Class

Clustering vs. Association Clustering Association

In this lecture, we will focus on classiﬁcation!

Classification algorithms • Linear Classifiers: Logistic Regression, Naive Bayes Classifier.

Logistic Regression • To ﬁnd weights (B) we train the

Again! We won’t implement this algorithm. We will use implemented

Naive Bayes Classiﬁer • In this case, training data is

Decision Tree <Mansplaining Decision Tree> Did she ask? Not mansplaining

Decision Tree <With numerical data>

Decision Tree • The most important part is determining the

Random Forest

Random Forest Test data Class A Class B Class A

Random Forest • The randomness aﬀects while building each tree.

Support Vector Machine 1. Find the border with the largest

SVM vs. Logistic Reg. • Logistic regression focuses on maximizing

K-Nearest Neighbor 1. How to determine K? 2. What kind

Things to be careful • Overﬁtting • Outlier

How can I choose the right algorithm?

Let’s get started!

1. Scikit learn • Preprocessing • Cross validation • Classiﬁers

2. Calculating similarities • Minkowski Distance • Manhattan Distance •

Thank you