Machine Learning 101 - Tryolabs. August 2016

"AI is the new electricity" 3

Machine Learning 101 An introduction for developers 5

Agenda 1. Python & its benefits 2. Types of Machine
Learning problems 3. Steps needed to solve them 4. How two classification algorithms work a. K-Nearest Neighbors b. Support Vector Machines 5. Evaluating an algorithm a. Overfitting 6. Demo 7. What follows 6

Why Python? • Simple & powerful • Fast prototyping ◦
Batteries included • Lots of libraries ◦ For everything, not just Machine Learning ◦ Bindings to integrate with other languages • Community ◦ Very active scientific community ◦ Used in academia & industry 7

Data Science “Swiss army knife” Data preparation / exploration /
visualization • Pandas • Matplotlib • Seaborn Modeling / Machine Learning • Scikit-learn • Pylearn2 Text focused • Gensim • nltk Deep learning • Keras • TensorFlow, Theano 8

Types of ML problems Supervised • Learn through examples of
which we know the desired output. • Two types: ◦ Classification (discrete variables, ie. spam/no spam) ◦ Regression (continuous output, ie. temperature) 9 Unsupervised • Discover latent relationships in the data • Two types: ◦ Dimensionality reduction (curse of dimensionality) ◦ Clustering Also have semi-supervised (use labeled and unlabeled data).

Steps in every Machine Learning problem To solve any ML
task, need to go through: 1. Data gathering 2. Data processing & feature engineering 3. Algorithm & training 4. Applying model to make predictions (evaluate, improve) 10

Data gathering • Depends on the specific task • Might
need to put humans to work :( ◦ Ie. Manual labelling for supervised learning ◦ Domain knowledge. Maybe even experts. ◦ Can leverage Mechanical Turk / CrowdFlower. • May come for free, or “sort of” ◦ Ie. Machine Translation. ◦ Categorized articles in Amazon, etc. • The more the better ◦ Some algorithms need large amounts of data to be useful (ie. neural networks). 11

Data processing Is there anything wrong with the data? •
Missing values • Outliers • Bad encoding (for text) • Wrongly-labeled examples • Biased data ◦ Do I have many more samples of one class than the rest? Need to fix/remove data? 12

Feature engineering What is a feature? A feature is an
individual measurable property of a phenomenon being observed Our inputs are represented by a set of features. Eg: • To classify spam email, features could be: ◦ Number of times some word appears (ie. pharmacy) ◦ Number of words that have been ch4ng3d like this. ◦ Language of the email ◦ Number of emojis 13 Buy ch34p drugs from the ph4rm4cy now :) :) :) :) (1, 2, English, 4) Feature engineering

Feature engineering (2) • Extract more information from existing data
• Not adding “new” data per-se ◦ Making it more useful ◦ With good features, most algorithms can learn faster • It can be an art ◦ Requires thought and knowledge of the data Two steps: • Variable transformation (eg. dates into weekdays, normalizing) • Feature creation (eg. n-grams for texts, if word is capitalized to detect names, etc) 14

Algorithm & training Supervised • Linear classifier • Naive Bayes
• Support Vector Machines (SVM) • Decision Tree • Random Forests • k-Nearest Neighbors • Neural Networks (Deep learning) Unsupervised / dimensionality reduction • PCA • t-SNE • k-means • DBSCAN They all understand vectors of numbers. Data are points in multi-dimensional space. 15

k-Nearest Neighbors (k-NN) Classification or regression. Idea (classification): • Choose
a natural number k >= 1 (parameter of the algorithm) • Given the sample X we want to label: ◦ Calculate some distance (ie. euclidean) to all the samples in the training set. ◦ Keep the k samples with the shortest distance to X. These are the nearest neighbors. ◦ Assign to X whatever class the majority of the neighbors have. 17

18 K = 1 K = 5

k-Nearest Neighbors (k-NN) (2) • Fast computation of nearest neighbors
is an active area of research. • Naive brute-force approach computes distance to all points, need to keep entire dataset in memory at classification time (no offline training). • Need to experiment to get the right k. There are other algorithms that use approximations to deal with these inefficiencies. 19

Support Vector Machines (SVM) Idea: • Separate data linearly in
space. • There are infinite planes that can separate blue & red dots. • SVM finds the optimal hyperplane to separate the two classes (the one that leaves the most margin). 21

Support Vector Machines (SVM) (2) • SVMs focus only on
the points that are the most difficult to tell apart (other classifiers pay attention to all the points). • We call them the support vectors. • Decision boundary doesn’t change if we add more samples that are outside the margin. • Can achieve good accuracy with fewer training samples compared to other algorithms. Only works if data is linearly separable. • If not, can use a kernel to transform it to a higher dimension. 22

Support Vector Machines: Kernel Trick 23

Evaluating • Split train / test set. Should not overlap!
• Accuracy ◦ What % of samples did it get right? • Precision / Recall ◦ True Positives, True Negatives, False Positives, False Negatives ◦ Precision = TP / (TP + FP) (out of all the classifier labeled positive, % that actually was) ◦ Recall = TP / (TP + FN) (out of all the positive, how many did it get right?) ◦ F-measure (harmonic mean, 2 * Precision * Recall / (Precision + Recall)) • Confusion matrix • Many others 24

Evaluating a spam classifier: example 25 Id Is spam Predicted
spam 1 Yes No 2 Yes Yes 3 No Yes 4 Yes Yes 5 No No 6 Yes No 7 No No accuracy = 4/7 ~ 0.57 precision = 2/(2 + 1) = ⅔ ~ 0.667 recall = 2/(2 + 2) = ½ ~ 0.5 F-measure = 2*(2/3 * 1/2)/(2/3 + 1/2) = 4/7 ~ 0.57 Confusion matrix Predicted spam Predicted not spam Spam 2 (TP) 2 (FN) Not spam 1 (FP) 2 (TN)

Overfitting • Models that adjust very well (“too well”) to
the training data • It does not generalize to new data. This is a problem! 26

Preventing overfitting • Detect outliers in the data • Simple
is better than complex ◦ Fewer parameters to tune can bring better performance. ◦ Eliminate degrees of freedom. Eg. polynomial ◦ Regularization (penalize complexity) • K-fold cross validation (different train/test partitions) • Get more training data! 27

Demo: scikit-learn 28

What follows There is A LOT to learn! Many active
research areas. • Dealing with natural language ◦ Language understanding ◦ Question answering ◦ Automatic summarization • Feature/Representation learning ◦ Embeddings (ie. word2vec) • Neural networks ◦ Deep learning ◦ Cool applications to signals (image, videos, sounds) ◦ Generation of image captions, etc. 29

Stay tuned. Thank you :) 30

Machine Learning 101 - Tryolabs. August 2016

Machine Learning 101 - Tryolabs. August 2016

Tryolabs

More Decks by Tryolabs

Other Decks in Technology

Featured

Transcript

1

2

"AI is the new electricity" 3

4

Machine Learning 101 An introduction for developers 5

Agenda 1. Python & its benefits 2. Types of Machine

Why Python? • Simple & powerful • Fast prototyping ◦

Data Science “Swiss army knife” Data preparation / exploration /

Types of ML problems Supervised • Learn through examples of

Steps in every Machine Learning problem To solve any ML

Data gathering • Depends on the specific task • Might

Data processing Is there anything wrong with the data? •

Feature engineering What is a feature? A feature is an

Feature engineering (2) • Extract more information from existing data

Algorithm & training Supervised • Linear classifier • Naive Bayes

16

k-Nearest Neighbors (k-NN) Classification or regression. Idea (classification): • Choose

18 K = 1 K = 5

k-Nearest Neighbors (k-NN) (2) • Fast computation of nearest neighbors

20

Support Vector Machines (SVM) Idea: • Separate data linearly in

Support Vector Machines (SVM) (2) • SVMs focus only on

Support Vector Machines: Kernel Trick 23

Evaluating • Split train / test set. Should not overlap!

Evaluating a spam classifier: example 25 Id Is spam Predicted

Overfitting • Models that adjust very well (“too well”) to

Preventing overfitting • Detect outliers in the data • Simple

Demo: scikit-learn 28

What follows There is A LOT to learn! Many active

Stay tuned. Thank you :) 30