Machine Learning for Complete Beginners

MACHINE LEARNING FOR COMPLETE BEGINNERS NISHAN CHATHURANGA Software Engineer 99X
(Pvt) Ltd. University of Moratuwa Faculty of Information Technology

DEMOS We will do a short demo to understand applications.
INTRODUCTION TO MODELS ML CONCEPTS Here we are learning basic concepts you need to know 01 Here we are learning different ML problems and models 02 TABLE OF CONTENTS 03

WHY MACHINE LEARNING VALUE

MACHINE LRNING ALGORITHMS ? MODELS ? WTF ? An “algorithm”
in machine learning is a procedure that is run on data to create a machine learning “model.” Machine learning algorithms perform “pattern recognition.” Algorithms “learn” from data, or are “fit” on a dataset.

ALGORITHM (DECISION TREE) Age<30 Income > $50K Xbox-One Customer Not
Xbox-One Customer Days Played > 728 Income > $50K Xbox-One Customer Not Xbox-One Customer Xbox-One Customer

ALGORITHM (DECISION TREE) Age<30 Income > $50K Xbox-One Customer Not
Xbox-One Customer Days Played > 728 Income > $50K Xbox-One Customer Not Xbox-One Customer Xbox-One Customer MODEL (IF-ELSE STATEMENTS WITH SPECIFIC VALUES.)

TYPES OF ML Learn from mistakes Task Driven Predict next
value Data Driven Identify clusters SUPERVISED LEARNING UNSUPERVISED LEARNING REINFORCEMENT LEARNING

Forecasting Forecasting is the process of making predictions based on
past and present data and most commonly by analysis of trends. A commonplace example might be estimation of some variable of interest at some specified future date.

Daily climate analysis in the city of Delhi

Categorical, Binary & Continuous Variables Categorical variables contain a finite
number of categories or distinct groups. Categorical data might not have a logical order. (e.g : Hair Color – Blonde/ Red/ Brown/ Black) Binary variable is a categorical variable that can only take one of two values, usually represented as a Boolean — True or False (e.g : Gender - Male/Female) Continuous variables are numeric variables that have an infinite number of values between any two values. A continuous variable can be numeric or date/time. (e.g : Age, Temperature) Refer - https://statistics.laerd.com/statistical- guides/types-of-variable.php

Prediction vs Classification Prediction is about predicting a missing/unknown element(continuous
value) of a dataset. In prediction, a classification/regression model is built to predict the outcome (continuous value). Classification is the prediction of a categorial variable within a predefined vocabulary based on training examples. The prediction of numerical (continuous) variables is called regression.

LABELED DATA

HEART ATTACK POSSIBILITY age sex cp trestbps chol fbs restecg
thalach exang oldpeak slope ca thal target 63 1 3 145 233 1 0 150 0 2.3 0 0 1 1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 1 41 0 1 130 204 0 0 172 0 1.4 2 0 2 1 56 1 1 120 236 0 1 178 0 0.8 2 0 2 1 55 1 0 160 289 0 0 145 1 0.8 1 1 3 0 64 1 0 120 246 0 0 96 1 2.2 0 1 2 0 70 1 0 130 322 0 0 109 0 2.4 1 3 2 0 51 1 0 140 299 0 1 173 1 1.6 2 0 3 0 58 1 0 125 300 0 0 171 0 0 2 2 3 0 56 0 0 134 409 0 0 150 1 1.9 1 2 3 We need predict a label of an unlabeled data point The variable we want to predict - CLASS VARIABLE

TRAINING DATA / TEST DATA 25% 75% Trainning Data Testing
Data DATA SET In a dataset, a training set is implemented to build up a model, while a test (or validation) set is to validate the model built. Test data also has the label.

TRAINING DATA / TEST DATA 55 0 1 132 342
0 1 166 0 1.2 2 0 2 1 41 1 1 120 157 0 1 182 0 0 2 0 2 1 38 1 2 138 175 0 1 173 0 0 2 4 2 1 38 1 2 138 175 0 1 173 0 0 2 4 2 1 67 1 0 160 286 0 0 108 1 1.5 1 3 2 0 67 1 0 120 229 0 0 129 1 2.6 1 2 3 0 62 0 0 140 268 0 0 160 0 3.6 0 2 2 0 Selecting 25% (or 33%) of the data randomly as test data.

0 1.2 2 0 2 120 157 0 1 182 0 0 2 0 2 138 175 0 1 173 0 0 2 4 2 138 175 0 1 173 0 0 2 4 2 160 286 0 0 108 1 1.5 1 3 2 120 229 0 0 129 1 2.6 1 2 3 140 268 0 0 160 0 3.6 0 2 2 Remove class variable.

0 1.2 2 0 2 1 120 157 0 1 182 0 0 2 0 2 0 138 175 0 1 173 0 0 2 4 2 1 138 175 0 1 173 0 0 2 4 2 1 160 286 0 0 108 1 1.5 1 3 2 0 120 229 0 0 129 1 2.6 1 2 3 0 140 268 0 0 160 0 3.6 0 2 2 1 Predict new class variable using model.

0 1 Compare actual and predicted values for scoring the model 1 1 1 1 0 0 0 Model accuracy 5 7 × 100 = 71.43%

CONFUSION MATRIX A Confusion matrix is an N x N
matrix used for evaluating the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model. This gives us a holistic view of how well our classification model is performing and what kinds of errors it is making.

CONFUSION MATRIX Not So Confusing! Refer - https://www.dataschool.io/simple-guide-to- confusion-matrix-terminology/ TP
- True Positive TN - True Negative FP - False Positive – Type 1 Error FN - False Negative – Type 2 Error Positive Class - CATS Negative Class - DOGS

ACCURACY, PRECISION AND RECALL Refer - https://towardsdatascience.com/accuracy- precision-recall-or-f1-331fb37c5cb9 Positive Class
- CATS Negative Class - DOGS

0 1 Compare actual and predicted values for scoring the model 1 1 1 1 0 0 0 Model accuracy 5 7 × 100 = 71.43%

0 1 Compare actual and predicted values for scoring the model 1 1 1 1 0 0 0 TP FN TP TP TN TN FP Positive Class - HEART ATTACK (1) Negative Class – NO HEART ATTACK (0)

WHAT IS MORE DANGEROUS? FN Or FP 1 0 1
1 0 0 1 actual predicted 1 1 1 1 0 0 0 TP FN TP TP TN TN FP In this case FN’s are more dangerous, because predicting a patient may not have a heart attack while he actually will have a heart attack is not a good prediction.

KDD (KNOWLEDGE DISCOVERY IN DATABASES) PROCESS

Data pre-processing is an important step in the data mining
process. Data-gathering methods are often loosely controlled, resulting in out-of- range values, impossible data combinations, and missing values, etc. Sometimes 99% of the work DATA PRE-PROCESSING & TRANSFORMATION

• Deleting Rows • Replacing With Mean/Median/Mode • Assigning An
Unique Category • Predicting The Missing Values • Using Algorithms Which Support Missing Values PRE-PROCESSING > HANDLE MISSING VALUES Refer - https://analyticsindiamag.com/5-ways- handle-missing-values-machine-learning-datasets/

• Resample the training set ◦ Under-sampling ◦ Over-sampling •
Use K-fold Cross-Validation • Resample with different ratios PRE-PROCESSING > HANDLE CLASS IMBALANCE Refer - https://www.kdnuggets.com/2017/06/7- techniques-handle-imbalanced-data.html Synthetic Minority Oversampling Technique

Machine learning algorithm just sees number — if there is
a vast difference in the range say few ranging in thousands and few ranging in the tens, and it makes the underlying assumption that higher ranging numbers have superiority of some sort. • Min Max Scaler • Standard Scaler • Max Abs Scaler TRANSFORMATION > NORMALIZATION / SCALING Refer - https://en.wikipedia.org/wiki/Feature_scaling

LIBRARIES / TOOLS Keras is an open-source neural- network library
written in Python. It is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, R, Theano, or PlaidML. Designed to enable fast experimentation with deep neural networks, it focuses on being user- friendly, modular, and extensible. NumPy is a library for the Python programming language, adding support for large, multi- dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. SciPy is a free and open- source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering.

TensorFlow is a free and open-source software library for dataflow
and differentiable programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks. Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. You can use Google Colaboratory to run python code and ML projects. Colab notebooks allow you to combine executable code and rich text in a single document, along with images, HTML, LaTeX and more. Google Colaboratory is a free online cloud-based Jupyter notebook environment Visit https://colab.research.google.com

DEMOS AND CODE github.com/nishanc/MLforBeginners

[email protected] @NishanTheDev NishanChathuranga THANK YOU

Doesn’t matter how slow you go, as long as you’re
moving forward

Machine Learning for Complete Beginners

Machine Learning for Complete Beginners

More Decks by Nishan Chathuranga

Other Decks in Programming

Featured

Transcript