Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning 101 - Tryolabs

Tryolabs
August 31, 2017

Machine Learning 101 - Tryolabs

Machine Learning 101 presentation deck by Tryolabs.

Tryolabs

August 31, 2017
Tweet

More Decks by Tryolabs

Other Decks in Technology

Transcript

  1. #ML101Tryolabs Agenda 1. What is Machine Learning 2. In practice:

    Python & its benefits 3. Types of Machine Learning problems 4. Steps needed to solve them 5. How two classification algorithms work a. K-Nearest Neighbors b. Support Vector Machines 6. Evaluating an algorithm a. Overfitting 7. Demo 8. (Very) basic intro to Deep Learning & what follows 2
  2. #ML101Tryolabs What is Machine Learning? The subfield of computer science

    that "gives computers the ability to learn without being explicitly programmed" (Arthur Samuel, 1959). A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E” (Tom Mitchell, 1997) • Machine Learnings vs AI? 5
  3. #ML101Tryolabs Why Python? • Simple & powerful • Fast prototyping

    ◦ Batteries included • Lots of libraries ◦ For everything, not just Machine Learning ◦ Bindings to integrate with other languages • Community ◦ Very active scientific community ◦ Used in academia & industry 9
  4. #ML101Tryolabs Supervised learning (1) 11 Supervised Unsupervised Reinforcement Learn through

    examples of which we know the desired output (what we want to predict). I want to know if the emails I receive are spam or not. I want to see if the reviews of the movies are positive, negative or neutral. I want to predict the market value of houses, given the square meters, number of rooms, neighborhood, etc.
  5. #ML101Tryolabs Supervised learning (2) 12 Supervised Unsupervised Reinforcement Output is

    continuous (ie. price, temperature). Classification Regression Output is a discrete variable (ie. spam/no spam).
  6. #ML101Tryolabs Unsupervised learning (1) 13 Supervised There is no desired

    output. Learn something about the data. Latent relationships. I have photos and want to put them in 20 groups. I want to find anomalies in the credit card usage patterns of my customers. Unsupervised Reinforcement
  7. #ML101Tryolabs Unsupervised learning (2) 14 Supervised Useful for learning structure

    in the data (clustering), hidden correlations, reduce dimensionality, etc. Unsupervised Reinforcement Also semi-supervised learning.
  8. #ML101Tryolabs Reinforcement learning 15 Supervised An agent interacts with an

    environment and watches the result of the interaction. Environment gives feedback via a positive or negative reward signal. Reinforcement Unsupervised Useful in games, robotics, etc.
  9. #ML101Tryolabs Steps in every Machine Learning problem To solve any

    ML task, need to go through: 1. Data gathering 2. Data preprocessing & feature engineering 3. Algorithm & training 4. Applying model to make predictions (evaluate, improve) 16
  10. #ML101Tryolabs Data gathering • Depends on the specific task •

    Might need to put humans to work :( ◦ Ie. Manual labelling for supervised learning ◦ Domain knowledge. Maybe even experts. ◦ Can leverage Mechanical Turk and others. • May come for free, or “sort of” ◦ Ie. Machine Translation. ◦ Categorized articles in Amazon, etc. • The more the better ◦ Some algorithms need large amounts of data to be useful (ie. neural networks). 17
  11. #ML101Tryolabs Data preprocessing Is there anything wrong with the data?

    • Missing values • Outliers • Bad encoding (for text) • Wrongly-labeled examples • Biased data ◦ Do I have many more samples of one class than the rest? Need to fix/remove data? 18
  12. #ML101Tryolabs Feature engineering What is a feature? A feature is

    an individual measurable property of a phenomenon being observed Our inputs are represented by a set of features. Eg: • To classify spam email, features could be: ◦ Number of times some word appears (ie. pharmacy) ◦ Number of words that have been ch4ng3d like this. ◦ Language of the email (0=English, 1=Spanish, …) ◦ Number of emojis 19 Buy ch34p drugs from the ph4rm4cy now :) :) :) :) (1, 2, 0, 4) Feature engineering
  13. #ML101Tryolabs Feature engineering (2) • Extract more information from existing

    data • Not adding “new” data per-se ◦ Making it more useful ◦ With good features, most algorithms can learn faster • It can be an art ◦ Requires thought and knowledge of the data Two steps: • Variable transformation (eg. dates into weekdays, normalizing) • Feature creation (eg. n-grams for texts, if word is capitalized to detect names, etc) 20
  14. #ML101Tryolabs Algorithm & training Supervised • Linear classifier • Naive

    Bayes • Support Vector Machines (SVM) • Decision Tree • Random Forests • k-Nearest Neighbors • Neural Networks (Deep learning) Unsupervised / dimensionality reduction • PCA • t-SNE • k-means • DBSCAN They all understand vectors of numbers. Data are points in multi-dimensional space. 21
  15. #ML101Tryolabs k-Nearest Neighbors (k-NN) Classification or regression. Idea (classification): •

    Choose a natural number k >= 1 (parameter of the algorithm) • Given the sample X we want to label: ◦ Calculate some distance (ie. euclidean) to all the samples in the training set. ◦ Keep the k samples with the shortest distance to X. These are the nearest neighbors. ◦ Assign to X whatever class the majority of the neighbors have. 23
  16. #ML101Tryolabs k-Nearest Neighbors (k-NN) (2) • Fast computation of nearest

    neighbors is an active area of research. • Naive brute-force approach computes distance to all points, need to keep entire dataset in memory at classification time (no offline training). • Need to experiment to get the right k. There are other algorithms that use approximations to deal with these inefficiencies. 25
  17. #ML101Tryolabs Support Vector Machines (SVM) Idea: • Separate data linearly

    in space. • There are infinite planes that can separate blue & red dots. • SVM finds the optimal hyperplane to separate the two classes (the one that leaves the most margin). 27
  18. #ML101Tryolabs Support Vector Machines (SVM) (2) • SVMs focus only

    on the points that are the most difficult to tell apart (other classifiers pay attention to all the points). • We call them the support vectors. • Decision boundary doesn’t change if we add more samples that are outside the margin. • Can achieve good accuracy with fewer training samples compared to other algorithms. Only works if data is linearly separable. • If not, can use a kernel to transform it to a higher dimension. 28
  19. #ML101Tryolabs Evaluating • Split train / test set. Should not

    overlap! • Accuracy ◦ What % of samples did it get right? • Precision / Recall ◦ True Positives, True Negatives, False Positives, False Negatives ◦ Precision = TP / (TP + FP) (out of all the classifier labeled positive, % that actually was) ◦ Recall = TP / (TP + FN) (out of all the positive, how many did it get right?) ◦ F-measure (harmonic mean, 2 * Precision * Recall / (Precision + Recall)) • Confusion matrix • Many others 30
  20. #ML101Tryolabs Evaluating a spam classifier: example 31 Id Is spam

    Predicted spam 1 Yes No 2 Yes Yes 3 No Yes 4 Yes Yes 5 No No 6 Yes No 7 No No accuracy = 4/7 ~ 0.57 precision = 2/(2 + 1) = ⅔ ~ 0.667 recall = 2/(2 + 2) = ½ ~ 0.5 F-measure = 2*(2/3 * 1/2)/(2/3 + 1/2) = 4/7 ~ 0.57 Confusion matrix Predicted spam Predicted not spam Spam 2 (TP) 2 (FN) Not spam 1 (FP) 2 (TN)
  21. #ML101Tryolabs Overfitting • Models that adjust very well (“too well”)

    to the training data • It does not generalize to new data. This is a problem! 32
  22. #ML101Tryolabs Preventing overfitting • Detect outliers in the data •

    Simple is better than complex ◦ Fewer parameters to tune can bring better performance. ◦ Eliminate degrees of freedom. Eg. polynomial. ◦ Regularization (penalize complexity). • K-fold cross validation (different train/test partitions) • Get more training data! 33
  23. #ML101Tryolabs What is Deep learning? • In part, it is

    rebranding of old technology. • First model of Artificial Neural Networks was proposed in 1943 (!). • The analogy to the human brain has been greatly exaggerated. 35 Perceptron (Rosenblatt, 1957) Two layer NN
  24. #ML101Tryolabs Deep learning • Training deep networks in practice was

    only made possible by recent advances. • Many state of the art breakthroughs in recent years (2012+), powered by the same underlying model. • Some tasks can now be performed at human accuracy (or even above!). 36
  25. #ML101Tryolabs ImageNet challenge 37 • Created in 2010. • The

    ImageNet dataset has 1000 categories and 1.2 million images. • An architecture called Convolutional Neural Networks (CNN or ConvNets) has dominated since 2012. • Krizhevsky et al. 2012. ◦ 1014 pixels used by training. ◦ 138M parameters.
  26. #ML101Tryolabs Deep learning 38 Raw input Manual feature extraction Algorithm

    Output • These algorithms are very good at difficult tasks, like pattern recognition. • They generalize when given sufficient training examples. • They learn representations of the data. Raw input Deep Learning algorithm Output Traditional ML Deep Learning
  27. #ML101Tryolabs What follows There is A LOT to learn! Many

    active research areas and infinite applications. We are living a revolution. • Rapid progress, need to keep up ;) • Entry barrier for using these technologies is lower than ever. • Processing power keeps increasing, models keep getting better, data storage gets cheaper. • What is the limit? 43