Machine Learning 101 - Tryolabs

Slide 1

Slide 1 text

#ML101Tryolabs 1

Slide 2

Slide 2 text

#ML101Tryolabs Agenda 1. What is Machine Learning 2. In practice: Python & its benefits 3. Types of Machine Learning problems 4. Steps needed to solve them 5. How two classification algorithms work a. K-Nearest Neighbors b. Support Vector Machines 6. Evaluating an algorithm a. Overfitting 7. Demo 8. (Very) basic intro to Deep Learning & what follows 2

Slide 3

Slide 3 text

#ML101Tryolabs "Software is eating the world" 3

Slide 4

Slide 4 text

#ML101Tryolabs 4 "AI is the new electricity"

Slide 5

Slide 5 text

#ML101Tryolabs What is Machine Learning? The subfield of computer science that "gives computers the ability to learn without being explicitly programmed" (Arthur Samuel, 1959). A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E” (Tom Mitchell, 1997) ● Machine Learnings vs AI? 5

Slide 6

Slide 6 text

#ML101Tryolabs AI with and without ML 6 Deep Blue (IBM), 1997. AlphaGo (DeepMind), 2016.

Slide 7

Slide 7 text

#ML101Tryolabs The “semantic” landscape 7 Artificial Intelligence Data Science Machine Learning Deep Learning

Slide 8

Slide 8 text

#ML101Tryolabs 8 Some Python code

Slide 9

Slide 9 text

#ML101Tryolabs Why Python? ● Simple & powerful ● Fast prototyping ○ Batteries included ● Lots of libraries ○ For everything, not just Machine Learning ○ Bindings to integrate with other languages ● Community ○ Very active scientific community ○ Used in academia & industry 9

Slide 10

Slide 10 text

#ML101Tryolabs Broad types of ML problems 10 Supervised Unsupervised Reinforcement

Slide 11

Slide 11 text

#ML101Tryolabs Supervised learning (1) 11 Supervised Unsupervised Reinforcement Learn through examples of which we know the desired output (what we want to predict). I want to know if the emails I receive are spam or not. I want to see if the reviews of the movies are positive, negative or neutral. I want to predict the market value of houses, given the square meters, number of rooms, neighborhood, etc.

Slide 12

Slide 12 text

#ML101Tryolabs Supervised learning (2) 12 Supervised Unsupervised Reinforcement Output is continuous (ie. price, temperature). Classification Regression Output is a discrete variable (ie. spam/no spam).

Slide 13

Slide 13 text

#ML101Tryolabs Unsupervised learning (1) 13 Supervised There is no desired output. Learn something about the data. Latent relationships. I have photos and want to put them in 20 groups. I want to find anomalies in the credit card usage patterns of my customers. Unsupervised Reinforcement

Slide 14

Slide 14 text

#ML101Tryolabs Unsupervised learning (2) 14 Supervised Useful for learning structure in the data (clustering), hidden correlations, reduce dimensionality, etc. Unsupervised Reinforcement Also semi-supervised learning.

Slide 15

Slide 15 text

#ML101Tryolabs Reinforcement learning 15 Supervised An agent interacts with an environment and watches the result of the interaction. Environment gives feedback via a positive or negative reward signal. Reinforcement Unsupervised Useful in games, robotics, etc.

Slide 16

Slide 16 text

#ML101Tryolabs Steps in every Machine Learning problem To solve any ML task, need to go through: 1. Data gathering 2. Data preprocessing & feature engineering 3. Algorithm & training 4. Applying model to make predictions (evaluate, improve) 16

Slide 17

Slide 17 text

#ML101Tryolabs Data gathering ● Depends on the specific task ● Might need to put humans to work :( ○ Ie. Manual labelling for supervised learning ○ Domain knowledge. Maybe even experts. ○ Can leverage Mechanical Turk and others. ● May come for free, or “sort of” ○ Ie. Machine Translation. ○ Categorized articles in Amazon, etc. ● The more the better ○ Some algorithms need large amounts of data to be useful (ie. neural networks). 17

Slide 18

Slide 18 text

#ML101Tryolabs Data preprocessing Is there anything wrong with the data? ● Missing values ● Outliers ● Bad encoding (for text) ● Wrongly-labeled examples ● Biased data ○ Do I have many more samples of one class than the rest? Need to fix/remove data? 18

Slide 19

Slide 19 text

#ML101Tryolabs Feature engineering What is a feature? A feature is an individual measurable property of a phenomenon being observed Our inputs are represented by a set of features. Eg: ● To classify spam email, features could be: ○ Number of times some word appears (ie. pharmacy) ○ Number of words that have been ch4ng3d like this. ○ Language of the email (0=English, 1=Spanish, …) ○ Number of emojis 19 Buy ch34p drugs from the ph4rm4cy now :) :) :) :) (1, 2, 0, 4) Feature engineering

Slide 20

Slide 20 text

#ML101Tryolabs Feature engineering (2) ● Extract more information from existing data ● Not adding “new” data per-se ○ Making it more useful ○ With good features, most algorithms can learn faster ● It can be an art ○ Requires thought and knowledge of the data Two steps: ● Variable transformation (eg. dates into weekdays, normalizing) ● Feature creation (eg. n-grams for texts, if word is capitalized to detect names, etc) 20

Slide 21

Slide 21 text

#ML101Tryolabs Algorithm & training Supervised ● Linear classifier ● Naive Bayes ● Support Vector Machines (SVM) ● Decision Tree ● Random Forests ● k-Nearest Neighbors ● Neural Networks (Deep learning) Unsupervised / dimensionality reduction ● PCA ● t-SNE ● k-means ● DBSCAN They all understand vectors of numbers. Data are points in multi-dimensional space. 21

Slide 22

Slide 22 text

#ML101Tryolabs 22

Slide 23

Slide 23 text

#ML101Tryolabs k-Nearest Neighbors (k-NN) Classification or regression. Idea (classification): ● Choose a natural number k >= 1 (parameter of the algorithm) ● Given the sample X we want to label: ○ Calculate some distance (ie. euclidean) to all the samples in the training set. ○ Keep the k samples with the shortest distance to X. These are the nearest neighbors. ○ Assign to X whatever class the majority of the neighbors have. 23

Slide 24

Slide 24 text

#ML101Tryolabs 24 K = 1 K = 5

Slide 25

Slide 25 text

#ML101Tryolabs k-Nearest Neighbors (k-NN) (2) ● Fast computation of nearest neighbors is an active area of research. ● Naive brute-force approach computes distance to all points, need to keep entire dataset in memory at classification time (no offline training). ● Need to experiment to get the right k. There are other algorithms that use approximations to deal with these inefficiencies. 25

Slide 26

Slide 26 text

#ML101Tryolabs 26

Slide 27

Slide 27 text

#ML101Tryolabs Support Vector Machines (SVM) Idea: ● Separate data linearly in space. ● There are infinite planes that can separate blue & red dots. ● SVM finds the optimal hyperplane to separate the two classes (the one that leaves the most margin). 27

Slide 28

Slide 28 text

#ML101Tryolabs Support Vector Machines (SVM) (2) ● SVMs focus only on the points that are the most difficult to tell apart (other classifiers pay attention to all the points). ● We call them the support vectors. ● Decision boundary doesn’t change if we add more samples that are outside the margin. ● Can achieve good accuracy with fewer training samples compared to other algorithms. Only works if data is linearly separable. ● If not, can use a kernel to transform it to a higher dimension. 28

Slide 29

Slide 29 text

#ML101Tryolabs Support Vector Machines: Kernel Trick 29

Slide 30

Slide 30 text

#ML101Tryolabs Evaluating ● Split train / test set. Should not overlap! ● Accuracy ○ What % of samples did it get right? ● Precision / Recall ○ True Positives, True Negatives, False Positives, False Negatives ○ Precision = TP / (TP + FP) (out of all the classifier labeled positive, % that actually was) ○ Recall = TP / (TP + FN) (out of all the positive, how many did it get right?) ○ F-measure (harmonic mean, 2 * Precision * Recall / (Precision + Recall)) ● Confusion matrix ● Many others 30

Slide 31

Slide 31 text

#ML101Tryolabs Evaluating a spam classifier: example 31 Id Is spam Predicted spam 1 Yes No 2 Yes Yes 3 No Yes 4 Yes Yes 5 No No 6 Yes No 7 No No accuracy = 4/7 ~ 0.57 precision = 2/(2 + 1) = ⅔ ~ 0.667 recall = 2/(2 + 2) = ½ ~ 0.5 F-measure = 2*(2/3 * 1/2)/(2/3 + 1/2) = 4/7 ~ 0.57 Confusion matrix Predicted spam Predicted not spam Spam 2 (TP) 2 (FN) Not spam 1 (FP) 2 (TN)

Slide 32

Slide 32 text

#ML101Tryolabs Overfitting ● Models that adjust very well (“too well”) to the training data ● It does not generalize to new data. This is a problem! 32

Slide 33

Slide 33 text

#ML101Tryolabs Preventing overfitting ● Detect outliers in the data ● Simple is better than complex ○ Fewer parameters to tune can bring better performance. ○ Eliminate degrees of freedom. Eg. polynomial. ○ Regularization (penalize complexity). ● K-fold cross validation (different train/test partitions) ● Get more training data! 33

Slide 34

Slide 34 text

Demo: scikit-learn 34

Slide 35

Slide 35 text

#ML101Tryolabs What is Deep learning? ● In part, it is rebranding of old technology. ● First model of Artificial Neural Networks was proposed in 1943 (!). ● The analogy to the human brain has been greatly exaggerated. 35 Perceptron (Rosenblatt, 1957) Two layer NN

Slide 36

Slide 36 text

#ML101Tryolabs Deep learning ● Training deep networks in practice was only made possible by recent advances. ● Many state of the art breakthroughs in recent years (2012+), powered by the same underlying model. ● Some tasks can now be performed at human accuracy (or even above!). 36

Slide 37

Slide 37 text

#ML101Tryolabs ImageNet challenge 37 ● Created in 2010. ● The ImageNet dataset has 1000 categories and 1.2 million images. ● An architecture called Convolutional Neural Networks (CNN or ConvNets) has dominated since 2012. ● Krizhevsky et al. 2012. ○ 1014 pixels used by training. ○ 138M parameters.

Slide 38

Slide 38 text

#ML101Tryolabs Deep learning 38 Raw input Manual feature extraction Algorithm Output ● These algorithms are very good at difficult tasks, like pattern recognition. ● They generalize when given sufficient training examples. ● They learn representations of the data. Raw input Deep Learning algorithm Output Traditional ML Deep Learning

Slide 39

Slide 39 text

#ML101Tryolabs Deep learning is already in your life 39

Slide 40

Slide 40 text

#ML101Tryolabs Deep learning is good with images 40

Slide 41

Slide 41 text

#ML101Tryolabs Deep learning games and art? 41

Slide 42

Slide 42 text

#ML101Tryolabs Deep learning on the road 42

Slide 43

Slide 43 text

#ML101Tryolabs What follows There is A LOT to learn! Many active research areas and infinite applications. We are living a revolution. ● Rapid progress, need to keep up ;) ● Entry barrier for using these technologies is lower than ever. ● Processing power keeps increasing, models keep getting better, data storage gets cheaper. ● What is the limit? 43

Slide 44

Slide 44 text

Stay tuned. Thank you :) @tryolabs 44