Machine Learning 101 - Tryolabs

#ML101Tryolabs 1

#ML101Tryolabs Agenda 1. What is Machine Learning 2. In practice:
Python & its benefits 3. Types of Machine Learning problems 4. Steps needed to solve them 5. How two classification algorithms work a. K-Nearest Neighbors b. Support Vector Machines 6. Evaluating an algorithm a. Overfitting 7. Demo 8. (Very) basic intro to Deep Learning & what follows 2

#ML101Tryolabs "Software is eating the world" 3

#ML101Tryolabs 4 "AI is the new electricity"

#ML101Tryolabs What is Machine Learning? The subfield of computer science
that "gives computers the ability to learn without being explicitly programmed" (Arthur Samuel, 1959). A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E” (Tom Mitchell, 1997) • Machine Learnings vs AI? 5

#ML101Tryolabs AI with and without ML 6 Deep Blue (IBM),
1997. AlphaGo (DeepMind), 2016.

#ML101Tryolabs The “semantic” landscape 7 Artificial Intelligence Data Science Machine
Learning Deep Learning

#ML101Tryolabs 8 Some Python code

#ML101Tryolabs Why Python? • Simple & powerful • Fast prototyping
◦ Batteries included • Lots of libraries ◦ For everything, not just Machine Learning ◦ Bindings to integrate with other languages • Community ◦ Very active scientific community ◦ Used in academia & industry 9

#ML101Tryolabs Broad types of ML problems 10 Supervised Unsupervised Reinforcement

#ML101Tryolabs Supervised learning (1) 11 Supervised Unsupervised Reinforcement Learn through
examples of which we know the desired output (what we want to predict). I want to know if the emails I receive are spam or not. I want to see if the reviews of the movies are positive, negative or neutral. I want to predict the market value of houses, given the square meters, number of rooms, neighborhood, etc.

#ML101Tryolabs Supervised learning (2) 12 Supervised Unsupervised Reinforcement Output is
continuous (ie. price, temperature). Classification Regression Output is a discrete variable (ie. spam/no spam).

#ML101Tryolabs Unsupervised learning (1) 13 Supervised There is no desired
output. Learn something about the data. Latent relationships. I have photos and want to put them in 20 groups. I want to find anomalies in the credit card usage patterns of my customers. Unsupervised Reinforcement

#ML101Tryolabs Unsupervised learning (2) 14 Supervised Useful for learning structure
in the data (clustering), hidden correlations, reduce dimensionality, etc. Unsupervised Reinforcement Also semi-supervised learning.

#ML101Tryolabs Reinforcement learning 15 Supervised An agent interacts with an
environment and watches the result of the interaction. Environment gives feedback via a positive or negative reward signal. Reinforcement Unsupervised Useful in games, robotics, etc.

#ML101Tryolabs Steps in every Machine Learning problem To solve any
ML task, need to go through: 1. Data gathering 2. Data preprocessing & feature engineering 3. Algorithm & training 4. Applying model to make predictions (evaluate, improve) 16

#ML101Tryolabs Data gathering • Depends on the specific task •
Might need to put humans to work :( ◦ Ie. Manual labelling for supervised learning ◦ Domain knowledge. Maybe even experts. ◦ Can leverage Mechanical Turk and others. • May come for free, or “sort of” ◦ Ie. Machine Translation. ◦ Categorized articles in Amazon, etc. • The more the better ◦ Some algorithms need large amounts of data to be useful (ie. neural networks). 17

#ML101Tryolabs Data preprocessing Is there anything wrong with the data?
• Missing values • Outliers • Bad encoding (for text) • Wrongly-labeled examples • Biased data ◦ Do I have many more samples of one class than the rest? Need to fix/remove data? 18

#ML101Tryolabs Feature engineering What is a feature? A feature is
an individual measurable property of a phenomenon being observed Our inputs are represented by a set of features. Eg: • To classify spam email, features could be: ◦ Number of times some word appears (ie. pharmacy) ◦ Number of words that have been ch4ng3d like this. ◦ Language of the email (0=English, 1=Spanish, …) ◦ Number of emojis 19 Buy ch34p drugs from the ph4rm4cy now :) :) :) :) (1, 2, 0, 4) Feature engineering

#ML101Tryolabs Feature engineering (2) • Extract more information from existing
data • Not adding “new” data per-se ◦ Making it more useful ◦ With good features, most algorithms can learn faster • It can be an art ◦ Requires thought and knowledge of the data Two steps: • Variable transformation (eg. dates into weekdays, normalizing) • Feature creation (eg. n-grams for texts, if word is capitalized to detect names, etc) 20

#ML101Tryolabs Algorithm & training Supervised • Linear classifier • Naive
Bayes • Support Vector Machines (SVM) • Decision Tree • Random Forests • k-Nearest Neighbors • Neural Networks (Deep learning) Unsupervised / dimensionality reduction • PCA • t-SNE • k-means • DBSCAN They all understand vectors of numbers. Data are points in multi-dimensional space. 21

#ML101Tryolabs 22

#ML101Tryolabs k-Nearest Neighbors (k-NN) Classification or regression. Idea (classification): •
Choose a natural number k >= 1 (parameter of the algorithm) • Given the sample X we want to label: ◦ Calculate some distance (ie. euclidean) to all the samples in the training set. ◦ Keep the k samples with the shortest distance to X. These are the nearest neighbors. ◦ Assign to X whatever class the majority of the neighbors have. 23

#ML101Tryolabs 24 K = 1 K = 5

#ML101Tryolabs k-Nearest Neighbors (k-NN) (2) • Fast computation of nearest
neighbors is an active area of research. • Naive brute-force approach computes distance to all points, need to keep entire dataset in memory at classification time (no offline training). • Need to experiment to get the right k. There are other algorithms that use approximations to deal with these inefficiencies. 25

#ML101Tryolabs 26

#ML101Tryolabs Support Vector Machines (SVM) Idea: • Separate data linearly
in space. • There are infinite planes that can separate blue & red dots. • SVM finds the optimal hyperplane to separate the two classes (the one that leaves the most margin). 27

#ML101Tryolabs Support Vector Machines (SVM) (2) • SVMs focus only
on the points that are the most difficult to tell apart (other classifiers pay attention to all the points). • We call them the support vectors. • Decision boundary doesn’t change if we add more samples that are outside the margin. • Can achieve good accuracy with fewer training samples compared to other algorithms. Only works if data is linearly separable. • If not, can use a kernel to transform it to a higher dimension. 28

#ML101Tryolabs Support Vector Machines: Kernel Trick 29

#ML101Tryolabs Evaluating • Split train / test set. Should not
overlap! • Accuracy ◦ What % of samples did it get right? • Precision / Recall ◦ True Positives, True Negatives, False Positives, False Negatives ◦ Precision = TP / (TP + FP) (out of all the classifier labeled positive, % that actually was) ◦ Recall = TP / (TP + FN) (out of all the positive, how many did it get right?) ◦ F-measure (harmonic mean, 2 * Precision * Recall / (Precision + Recall)) • Confusion matrix • Many others 30

#ML101Tryolabs Evaluating a spam classifier: example 31 Id Is spam
Predicted spam 1 Yes No 2 Yes Yes 3 No Yes 4 Yes Yes 5 No No 6 Yes No 7 No No accuracy = 4/7 ~ 0.57 precision = 2/(2 + 1) = ⅔ ~ 0.667 recall = 2/(2 + 2) = ½ ~ 0.5 F-measure = 2*(2/3 * 1/2)/(2/3 + 1/2) = 4/7 ~ 0.57 Confusion matrix Predicted spam Predicted not spam Spam 2 (TP) 2 (FN) Not spam 1 (FP) 2 (TN)

#ML101Tryolabs Overfitting • Models that adjust very well (“too well”)
to the training data • It does not generalize to new data. This is a problem! 32

#ML101Tryolabs Preventing overfitting • Detect outliers in the data •
Simple is better than complex ◦ Fewer parameters to tune can bring better performance. ◦ Eliminate degrees of freedom. Eg. polynomial. ◦ Regularization (penalize complexity). • K-fold cross validation (different train/test partitions) • Get more training data! 33

Demo: scikit-learn 34

#ML101Tryolabs What is Deep learning? • In part, it is
rebranding of old technology. • First model of Artificial Neural Networks was proposed in 1943 (!). • The analogy to the human brain has been greatly exaggerated. 35 Perceptron (Rosenblatt, 1957) Two layer NN

#ML101Tryolabs Deep learning • Training deep networks in practice was
only made possible by recent advances. • Many state of the art breakthroughs in recent years (2012+), powered by the same underlying model. • Some tasks can now be performed at human accuracy (or even above!). 36

#ML101Tryolabs ImageNet challenge 37 • Created in 2010. • The
ImageNet dataset has 1000 categories and 1.2 million images. • An architecture called Convolutional Neural Networks (CNN or ConvNets) has dominated since 2012. • Krizhevsky et al. 2012. ◦ 1014 pixels used by training. ◦ 138M parameters.

#ML101Tryolabs Deep learning 38 Raw input Manual feature extraction Algorithm
Output • These algorithms are very good at difficult tasks, like pattern recognition. • They generalize when given sufficient training examples. • They learn representations of the data. Raw input Deep Learning algorithm Output Traditional ML Deep Learning

#ML101Tryolabs Deep learning is already in your life 39

#ML101Tryolabs Deep learning is good with images 40

#ML101Tryolabs Deep learning games and art? 41

#ML101Tryolabs Deep learning on the road 42

#ML101Tryolabs What follows There is A LOT to learn! Many
active research areas and infinite applications. We are living a revolution. • Rapid progress, need to keep up ;) • Entry barrier for using these technologies is lower than ever. • Processing power keeps increasing, models keep getting better, data storage gets cheaper. • What is the limit? 43

Stay tuned. Thank you :) @tryolabs 44

Machine Learning 101 - Tryolabs

Machine Learning 101 - Tryolabs

More Decks by Tryolabs

Other Decks in Technology

Featured

Transcript