Python & its benefits 3. Types of Machine Learning problems 4. Steps needed to solve them 5. How two classification algorithms work a. K-Nearest Neighbors b. Support Vector Machines 6. Evaluating an algorithm a. Overfitting 7. Demo 8. (Very) basic intro to Deep Learning & what follows 2

that "gives computers the ability to learn without being explicitly programmed" (Arthur Samuel, 1959). A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E” (Tom Mitchell, 1997) • Machine Learnings vs AI? 5

◦ Batteries included • Lots of libraries ◦ For everything, not just Machine Learning ◦ Bindings to integrate with other languages • Community ◦ Very active scientific community ◦ Used in academia & industry 9

examples of which we know the desired output (what we want to predict). I want to know if the emails I receive are spam or not. I want to see if the reviews of the movies are positive, negative or neutral. I want to predict the market value of houses, given the square meters, number of rooms, neighborhood, etc.

output. Learn something about the data. Latent relationships. I have photos and want to put them in 20 groups. I want to find anomalies in the credit card usage patterns of my customers. Unsupervised Reinforcement

environment and watches the result of the interaction. Environment gives feedback via a positive or negative reward signal. Reinforcement Unsupervised Useful in games, robotics, etc.

ML task, need to go through: 1. Data gathering 2. Data preprocessing & feature engineering 3. Algorithm & training 4. Applying model to make predictions (evaluate, improve) 16

Might need to put humans to work :( ◦ Ie. Manual labelling for supervised learning ◦ Domain knowledge. Maybe even experts. ◦ Can leverage Mechanical Turk and others. • May come for free, or “sort of” ◦ Ie. Machine Translation. ◦ Categorized articles in Amazon, etc. • The more the better ◦ Some algorithms need large amounts of data to be useful (ie. neural networks). 17

• Missing values • Outliers • Bad encoding (for text) • Wrongly-labeled examples • Biased data ◦ Do I have many more samples of one class than the rest? Need to fix/remove data? 18

an individual measurable property of a phenomenon being observed Our inputs are represented by a set of features. Eg: • To classify spam email, features could be: ◦ Number of times some word appears (ie. pharmacy) ◦ Number of words that have been ch4ng3d like this. ◦ Language of the email (0=English, 1=Spanish, …) ◦ Number of emojis 19 Buy ch34p drugs from the ph4rm4cy now :) :) :) :) (1, 2, 0, 4) Feature engineering

data • Not adding “new” data per-se ◦ Making it more useful ◦ With good features, most algorithms can learn faster • It can be an art ◦ Requires thought and knowledge of the data Two steps: • Variable transformation (eg. dates into weekdays, normalizing) • Feature creation (eg. n-grams for texts, if word is capitalized to detect names, etc) 20

Bayes • Support Vector Machines (SVM) • Decision Tree • Random Forests • k-Nearest Neighbors • Neural Networks (Deep learning) Unsupervised / dimensionality reduction • PCA • t-SNE • k-means • DBSCAN They all understand vectors of numbers. Data are points in multi-dimensional space. 21

Choose a natural number k >= 1 (parameter of the algorithm) • Given the sample X we want to label: ◦ Calculate some distance (ie. euclidean) to all the samples in the training set. ◦ Keep the k samples with the shortest distance to X. These are the nearest neighbors. ◦ Assign to X whatever class the majority of the neighbors have. 23

neighbors is an active area of research. • Naive brute-force approach computes distance to all points, need to keep entire dataset in memory at classification time (no offline training). • Need to experiment to get the right k. There are other algorithms that use approximations to deal with these inefficiencies. 25

in space. • There are infinite planes that can separate blue & red dots. • SVM finds the optimal hyperplane to separate the two classes (the one that leaves the most margin). 27

on the points that are the most difficult to tell apart (other classifiers pay attention to all the points). • We call them the support vectors. • Decision boundary doesn’t change if we add more samples that are outside the margin. • Can achieve good accuracy with fewer training samples compared to other algorithms. Only works if data is linearly separable. • If not, can use a kernel to transform it to a higher dimension. 28

overlap! • Accuracy ◦ What % of samples did it get right? • Precision / Recall ◦ True Positives, True Negatives, False Positives, False Negatives ◦ Precision = TP / (TP + FP) (out of all the classifier labeled positive, % that actually was) ◦ Recall = TP / (TP + FN) (out of all the positive, how many did it get right?) ◦ F-measure (harmonic mean, 2 * Precision * Recall / (Precision + Recall)) • Confusion matrix • Many others 30

Simple is better than complex ◦ Fewer parameters to tune can bring better performance. ◦ Eliminate degrees of freedom. Eg. polynomial. ◦ Regularization (penalize complexity). • K-fold cross validation (different train/test partitions) • Get more training data! 33

rebranding of old technology. • First model of Artificial Neural Networks was proposed in 1943 (!). • The analogy to the human brain has been greatly exaggerated. 35 Perceptron (Rosenblatt, 1957) Two layer NN

only made possible by recent advances. • Many state of the art breakthroughs in recent years (2012+), powered by the same underlying model. • Some tasks can now be performed at human accuracy (or even above!). 36

ImageNet dataset has 1000 categories and 1.2 million images. • An architecture called Convolutional Neural Networks (CNN or ConvNets) has dominated since 2012. • Krizhevsky et al. 2012. ◦ 1014 pixels used by training. ◦ 138M parameters.

Output • These algorithms are very good at difficult tasks, like pattern recognition. • They generalize when given sufficient training examples. • They learn representations of the data. Raw input Deep Learning algorithm Output Traditional ML Deep Learning

active research areas and infinite applications. We are living a revolution. • Rapid progress, need to keep up ;) • Entry barrier for using these technologies is lower than ever. • Processing power keeps increasing, models keep getting better, data storage gets cheaper. • What is the limit? 43