Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning 101 - Tryolabs. August 2016

Tryolabs
August 18, 2016

Machine Learning 101 - Tryolabs. August 2016

Slides of the meetup held at Tryolabs' offices which introduce key concepts about Machine Learning.

Tryolabs

August 18, 2016
Tweet

More Decks by Tryolabs

Other Decks in Technology

Transcript

  1. 1

  2. 2

  3. 4

  4. Agenda 1. Python & its benefits 2. Types of Machine

    Learning problems 3. Steps needed to solve them 4. How two classification algorithms work a. K-Nearest Neighbors b. Support Vector Machines 5. Evaluating an algorithm a. Overfitting 6. Demo 7. What follows 6
  5. Why Python? • Simple & powerful • Fast prototyping ◦

    Batteries included • Lots of libraries ◦ For everything, not just Machine Learning ◦ Bindings to integrate with other languages • Community ◦ Very active scientific community ◦ Used in academia & industry 7
  6. Data Science “Swiss army knife” Data preparation / exploration /

    visualization • Pandas • Matplotlib • Seaborn Modeling / Machine Learning • Scikit-learn • Pylearn2 Text focused • Gensim • nltk Deep learning • Keras • TensorFlow, Theano 8
  7. Types of ML problems Supervised • Learn through examples of

    which we know the desired output. • Two types: ◦ Classification (discrete variables, ie. spam/no spam) ◦ Regression (continuous output, ie. temperature) 9 Unsupervised • Discover latent relationships in the data • Two types: ◦ Dimensionality reduction (curse of dimensionality) ◦ Clustering Also have semi-supervised (use labeled and unlabeled data).
  8. Steps in every Machine Learning problem To solve any ML

    task, need to go through: 1. Data gathering 2. Data processing & feature engineering 3. Algorithm & training 4. Applying model to make predictions (evaluate, improve) 10
  9. Data gathering • Depends on the specific task • Might

    need to put humans to work :( ◦ Ie. Manual labelling for supervised learning ◦ Domain knowledge. Maybe even experts. ◦ Can leverage Mechanical Turk / CrowdFlower. • May come for free, or “sort of” ◦ Ie. Machine Translation. ◦ Categorized articles in Amazon, etc. • The more the better ◦ Some algorithms need large amounts of data to be useful (ie. neural networks). 11
  10. Data processing Is there anything wrong with the data? •

    Missing values • Outliers • Bad encoding (for text) • Wrongly-labeled examples • Biased data ◦ Do I have many more samples of one class than the rest? Need to fix/remove data? 12
  11. Feature engineering What is a feature? A feature is an

    individual measurable property of a phenomenon being observed Our inputs are represented by a set of features. Eg: • To classify spam email, features could be: ◦ Number of times some word appears (ie. pharmacy) ◦ Number of words that have been ch4ng3d like this. ◦ Language of the email ◦ Number of emojis 13 Buy ch34p drugs from the ph4rm4cy now :) :) :) :) (1, 2, English, 4) Feature engineering
  12. Feature engineering (2) • Extract more information from existing data

    • Not adding “new” data per-se ◦ Making it more useful ◦ With good features, most algorithms can learn faster • It can be an art ◦ Requires thought and knowledge of the data Two steps: • Variable transformation (eg. dates into weekdays, normalizing) • Feature creation (eg. n-grams for texts, if word is capitalized to detect names, etc) 14
  13. Algorithm & training Supervised • Linear classifier • Naive Bayes

    • Support Vector Machines (SVM) • Decision Tree • Random Forests • k-Nearest Neighbors • Neural Networks (Deep learning) Unsupervised / dimensionality reduction • PCA • t-SNE • k-means • DBSCAN They all understand vectors of numbers. Data are points in multi-dimensional space. 15
  14. 16

  15. k-Nearest Neighbors (k-NN) Classification or regression. Idea (classification): • Choose

    a natural number k >= 1 (parameter of the algorithm) • Given the sample X we want to label: ◦ Calculate some distance (ie. euclidean) to all the samples in the training set. ◦ Keep the k samples with the shortest distance to X. These are the nearest neighbors. ◦ Assign to X whatever class the majority of the neighbors have. 17
  16. k-Nearest Neighbors (k-NN) (2) • Fast computation of nearest neighbors

    is an active area of research. • Naive brute-force approach computes distance to all points, need to keep entire dataset in memory at classification time (no offline training). • Need to experiment to get the right k. There are other algorithms that use approximations to deal with these inefficiencies. 19
  17. 20

  18. Support Vector Machines (SVM) Idea: • Separate data linearly in

    space. • There are infinite planes that can separate blue & red dots. • SVM finds the optimal hyperplane to separate the two classes (the one that leaves the most margin). 21
  19. Support Vector Machines (SVM) (2) • SVMs focus only on

    the points that are the most difficult to tell apart (other classifiers pay attention to all the points). • We call them the support vectors. • Decision boundary doesn’t change if we add more samples that are outside the margin. • Can achieve good accuracy with fewer training samples compared to other algorithms. Only works if data is linearly separable. • If not, can use a kernel to transform it to a higher dimension. 22
  20. Evaluating • Split train / test set. Should not overlap!

    • Accuracy ◦ What % of samples did it get right? • Precision / Recall ◦ True Positives, True Negatives, False Positives, False Negatives ◦ Precision = TP / (TP + FP) (out of all the classifier labeled positive, % that actually was) ◦ Recall = TP / (TP + FN) (out of all the positive, how many did it get right?) ◦ F-measure (harmonic mean, 2 * Precision * Recall / (Precision + Recall)) • Confusion matrix • Many others 24
  21. Evaluating a spam classifier: example 25 Id Is spam Predicted

    spam 1 Yes No 2 Yes Yes 3 No Yes 4 Yes Yes 5 No No 6 Yes No 7 No No accuracy = 4/7 ~ 0.57 precision = 2/(2 + 1) = ⅔ ~ 0.667 recall = 2/(2 + 2) = ½ ~ 0.5 F-measure = 2*(2/3 * 1/2)/(2/3 + 1/2) = 4/7 ~ 0.57 Confusion matrix Predicted spam Predicted not spam Spam 2 (TP) 2 (FN) Not spam 1 (FP) 2 (TN)
  22. Overfitting • Models that adjust very well (“too well”) to

    the training data • It does not generalize to new data. This is a problem! 26
  23. Preventing overfitting • Detect outliers in the data • Simple

    is better than complex ◦ Fewer parameters to tune can bring better performance. ◦ Eliminate degrees of freedom. Eg. polynomial ◦ Regularization (penalize complexity) • K-fold cross validation (different train/test partitions) • Get more training data! 27
  24. What follows There is A LOT to learn! Many active

    research areas. • Dealing with natural language ◦ Language understanding ◦ Question answering ◦ Automatic summarization • Feature/Representation learning ◦ Embeddings (ie. word2vec) • Neural networks ◦ Deep learning ◦ Cool applications to signals (image, videos, sounds) ◦ Generation of image captions, etc. 29