DEMOS We will do a short demo to understand applications. INTRODUCTION TO MODELS ML CONCEPTS Here we are learning basic concepts you need to know 01 Here we are learning different ML problems and models 02 TABLE OF CONTENTS 03

MACHINE LRNING ALGORITHMS ? MODELS ? WTF ? An “algorithm” in machine learning is a procedure that is run on data to create a machine learning “model.” Machine learning algorithms perform “pattern recognition.” Algorithms “learn” from data, or are “fit” on a dataset.

ALGORITHM (DECISION TREE) Age<30 Income > $50K Xbox-One Customer Not Xbox-One Customer Days Played > 728 Income > $50K Xbox-One Customer Not Xbox-One Customer Xbox-One Customer

ALGORITHM (DECISION TREE) Age<30 Income > $50K Xbox-One Customer Not Xbox-One Customer Days Played > 728 Income > $50K Xbox-One Customer Not Xbox-One Customer Xbox-One Customer MODEL (IF-ELSE STATEMENTS WITH SPECIFIC VALUES.)

TYPES OF ML Learn from mistakes Task Driven Predict next value Data Driven Identify clusters SUPERVISED LEARNING UNSUPERVISED LEARNING REINFORCEMENT LEARNING

Forecasting Forecasting is the process of making predictions based on past and present data and most commonly by analysis of trends. A commonplace example might be estimation of some variable of interest at some specified future date.

Categorical, Binary & Continuous Variables Categorical variables contain a finite number of categories or distinct groups. Categorical data might not have a logical order. (e.g : Hair Color – Blonde/ Red/ Brown/ Black) Binary variable is a categorical variable that can only take one of two values, usually represented as a Boolean — True or False (e.g : Gender - Male/Female) Continuous variables are numeric variables that have an infinite number of values between any two values. A continuous variable can be numeric or date/time. (e.g : Age, Temperature) Refer - https://statistics.laerd.com/statistical- guides/types-of-variable.php

Prediction vs Classification Prediction is about predicting a missing/unknown element(continuous value) of a dataset. In prediction, a classification/regression model is built to predict the outcome (continuous value). Classification is the prediction of a categorial variable within a predefined vocabulary based on training examples. The prediction of numerical (continuous) variables is called regression.

TRAINING DATA / TEST DATA 25% 75% Trainning Data Testing Data DATA SET In a dataset, a training set is implemented to build up a model, while a test (or validation) set is to validate the model built. Test data also has the label.

CONFUSION MATRIX A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model. This gives us a holistic view of how well our classification model is performing and what kinds of errors it is making.

TRAINING DATA / TEST DATA 1 0 1 1 0 0 1 Compare actual and predicted values for scoring the model 1 1 1 1 0 0 0 TP FN TP TP TN TN FP Positive Class - HEART ATTACK (1) Negative Class – NO HEART ATTACK (0)

WHAT IS MORE DANGEROUS? FN Or FP 1 0 1 1 0 0 1 actual predicted 1 1 1 1 0 0 0 TP FN TP TP TN TN FP In this case FN’s are more dangerous, because predicting a patient may not have a heart attack while he actually will have a heart attack is not a good prediction.

WHAT IS MORE DANGEROUS? FN Or FP 1 0 1 1 0 0 1 actual predicted 1 1 1 1 0 0 0 TP FN TP TP TN TN FP In this case FN’s are more dangerous, because predicting a patient may not have a heart attack while he actually will have a heart attack is not a good prediction.

Data pre-processing is an important step in the data mining process. Data-gathering methods are often loosely controlled, resulting in out-of- range values, impossible data combinations, and missing values, etc. Sometimes 99% of the work DATA PRE-PROCESSING & TRANSFORMATION

● Resample the training set ○ Under-sampling ○ Over-sampling ● Use K-fold Cross-Validation ● Resample with different ratios PRE-PROCESSING > HANDLE CLASS IMBALANCE Refer - https://www.kdnuggets.com/2017/06/7- techniques-handle-imbalanced-data.html Synthetic Minority Oversampling Technique

Machine learning algorithm just sees number — if there is a vast difference in the range say few ranging in thousands and few ranging in the tens, and it makes the underlying assumption that higher ranging numbers have superiority of some sort. ● Min Max Scaler ● Standard Scaler ● Max Abs Scaler TRANSFORMATION > NORMALIZATION / SCALING Refer - https://en.wikipedia.org/wiki/Feature_scaling

LIBRARIES / TOOLS Keras is an open-source neural- network library written in Python. It is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, R, Theano, or PlaidML. Designed to enable fast experimentation with deep neural networks, it focuses on being user- friendly, modular, and extensible. NumPy is a library for the Python programming language, adding support for large, multi- dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. SciPy is a free and open- source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering.

TensorFlow is a free and open-source software library for dataflow and differentiable programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks. Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. You can use Google Colaboratory to run python code and ML projects. Colab notebooks allow you to combine executable code and rich text in a single document, along with images, HTML, LaTeX and more. Google Colaboratory is a free online cloud-based Jupyter notebook environment Visit https://colab.research.google.com