Nishan Chathuranga
June 13, 2021
250

# Machine Learning for Complete Beginners

Related to the session conducted for Faculty of IT, University of Moratuwa on 13/06/2020

June 13, 2021

## Transcript

1. MACHINE LEARNING
FOR COMPLETE BEGINNERS
NISHAN CHATHURANGA
Software Engineer
99X (Pvt) Ltd.
University of Moratuwa
Faculty of Information Technology

2. DEMOS
We will do a short
demo to
understand
applications.
INTRODUCTION
TO MODELS
ML CONCEPTS
Here we are
learning basic
concepts you need
to know
01
Here we are
learning different
ML problems and
models
02
TABLE OF CONTENTS
03

3. WHY MACHINE LEARNING
VALUE

4. MACHINE LRNING ALGORITHMS ? MODELS ? WTF ?
An “algorithm” in machine
learning is a procedure that
is run on data to create a
machine learning “model.”
Machine learning algorithms
perform “pattern recognition.”
Algorithms “learn” from data,
or are “fit” on a dataset.

5. ALGORITHM (DECISION TREE)
Age<30
Income >
\$50K
Xbox-One
Customer
Not Xbox-One
Customer
Days Played >
728
Income >
\$50K
Xbox-One
Customer
Not Xbox-One
Customer
Xbox-One
Customer

6. ALGORITHM (DECISION TREE)
Age<30
Income > \$50K
Xbox-One
Customer
Not Xbox-One
Customer
Days Played >
728
Income > \$50K
Xbox-One
Customer
Not Xbox-One
Customer
Xbox-One
Customer
MODEL (IF-ELSE STATEMENTS
WITH SPECIFIC VALUES.)

7. TYPES OF ML
Learn from
mistakes
Task Driven
Predict next value
Data Driven
Identify clusters
SUPERVISED
LEARNING
UNSUPERVISED
LEARNING
REINFORCEMENT
LEARNING

8. Forecasting
Forecasting is the process of making predictions based on past and present data
and most commonly by analysis of trends. A commonplace example might be
estimation of some variable of interest at some specified future date.

9. Daily climate analysis in the city of Delhi

10. Categorical, Binary & Continuous
Variables
Categorical variables contain a finite number of categories or distinct groups.
Categorical data might not have a logical order. (e.g : Hair Color – Blonde/ Red/
Brown/ Black)
Binary variable is a categorical variable that can only take one of two values,
usually represented as a Boolean — True or False (e.g : Gender - Male/Female)
Continuous variables are numeric variables that have an infinite number of
values between any two values. A continuous variable can be numeric or
date/time. (e.g : Age, Temperature)
Refer - https://statistics.laerd.com/statistical-
guides/types-of-variable.php

11. Prediction vs Classification
Prediction is about predicting a missing/unknown element(continuous value) of
a dataset. In prediction, a classification/regression model is built to predict the
outcome (continuous value).
Classification is the prediction of a categorial variable within a predefined
vocabulary based on training examples. The prediction of numerical
(continuous) variables is called regression.

12. LABELED DATA

13. HEART ATTACK POSSIBILITY
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
63 1 3 145 233 1 0 150 0 2.3 0 0 1 1
37 1 2 130 250 0 1 187 0 3.5 0 0 2 1
41 0 1 130 204 0 0 172 0 1.4 2 0 2 1
56 1 1 120 236 0 1 178 0 0.8 2 0 2 1
55 1 0 160 289 0 0 145 1 0.8 1 1 3 0
64 1 0 120 246 0 0 96 1 2.2 0 1 2 0
70 1 0 130 322 0 0 109 0 2.4 1 3 2 0
51 1 0 140 299 0 1 173 1 1.6 2 0 3 0
58 1 0 125 300 0 0 171 0 0 2 2 3 0
56 0 0 134 409 0 0 150 1 1.9 1 2 3
We need predict a label of an unlabeled data point
The variable we want
to predict - CLASS VARIABLE

14. TRAINING DATA / TEST DATA
25%
75% Trainning Data
Testing Data
DATA SET
In a dataset, a training set is
implemented to build up a
model, while a test (or
validation) set is to validate the
model built.
Test data also has the label.

15. TRAINING DATA / TEST DATA
55 0 1 132 342 0 1 166 0 1.2 2 0 2 1
41 1 1 120 157 0 1 182 0 0 2 0 2 1
38 1 2 138 175 0 1 173 0 0 2 4 2 1
38 1 2 138 175 0 1 173 0 0 2 4 2 1
67 1 0 160 286 0 0 108 1 1.5 1 3 2 0
67 1 0 120 229 0 0 129 1 2.6 1 2 3 0
62 0 0 140 268 0 0 160 0 3.6 0 2 2 0
Selecting 25% (or 33%) of the data randomly as test data.

16. TRAINING DATA / TEST DATA
132 342 0 1 166 0 1.2 2 0 2
120 157 0 1 182 0 0 2 0 2
138 175 0 1 173 0 0 2 4 2
138 175 0 1 173 0 0 2 4 2
160 286 0 0 108 1 1.5 1 3 2
120 229 0 0 129 1 2.6 1 2 3
140 268 0 0 160 0 3.6 0 2 2
Remove class variable.

17. TRAINING DATA / TEST DATA
132 342 0 1 166 0 1.2 2 0 2 1
120 157 0 1 182 0 0 2 0 2 0
138 175 0 1 173 0 0 2 4 2 1
138 175 0 1 173 0 0 2 4 2 1
160 286 0 0 108 1 1.5 1 3 2 0
120 229 0 0 129 1 2.6 1 2 3 0
140 268 0 0 160 0 3.6 0 2 2 1
Predict new class variable using model.

18. TRAINING DATA / TEST DATA
1
0
1
1
0
0
1
Compare actual and predicted values for scoring the model
1
1
1
1
0
0
0
Model accuracy
5
7
× 100 = 71.43%

19. CONFUSION MATRIX
A Confusion matrix is an N x N
matrix used for evaluating the
performance of a classification
model, where N is the number
of target classes.
The matrix compares the
actual target values with those
predicted by the machine
learning model. This gives us a
holistic view of how well our
classification model is
performing and what kinds of
errors it is making.

20. CONFUSION MATRIX
Not So Confusing!
Refer - https://www.dataschool.io/simple-guide-to-
confusion-matrix-terminology/
TP - True Positive
TN - True Negative
FP - False Positive – Type 1 Error
FN - False Negative – Type 2 Error
Positive Class - CATS
Negative Class - DOGS

21. ACCURACY, PRECISION AND RECALL
Refer - https://towardsdatascience.com/accuracy-
precision-recall-or-f1-331fb37c5cb9
Positive Class - CATS
Negative Class - DOGS

22. TRAINING DATA / TEST DATA
1
0
1
1
0
0
1
Compare actual and predicted values for scoring the model
1
1
1
1
0
0
0
Model accuracy
5
7
× 100 = 71.43%

23. TRAINING DATA / TEST DATA
1
0
1
1
0
0
1
Compare actual and predicted values for scoring the model
1
1
1
1
0
0
0
TP
FN
TP
TP
TN
TN
FP
Positive Class - HEART ATTACK (1)
Negative Class – NO HEART ATTACK (0)

24. WHAT IS MORE DANGEROUS? FN Or FP
1
0
1
1
0
0
1
actual predicted
1
1
1
1
0
0
0
TP
FN
TP
TP
TN
TN
FP
In this case FN’s are more
dangerous, because predicting a
patient may not have a heart attack
while he actually will have a heart
attack is not a good prediction.

25. WHAT IS MORE DANGEROUS? FN Or FP
1
0
1
1
0
0
1
actual predicted
1
1
1
1
0
0
0
TP
FN
TP
TP
TN
TN
FP
In this case FN’s are more
dangerous, because predicting a
patient may not have a heart attack
while he actually will have a heart
attack is not a good prediction.

26. KDD (KNOWLEDGE DISCOVERY IN
DATABASES) PROCESS

27. Data pre-processing is an
important step in the data
mining process. Data-gathering
methods are often loosely
controlled, resulting in out-of-
range values, impossible data
combinations, and missing
values, etc.
Sometimes 99% of the work
DATA PRE-PROCESSING &
TRANSFORMATION

28. ● Deleting Rows
● Replacing With Mean/Median/Mode
● Assigning An Unique Category
● Predicting The Missing Values
● Using Algorithms Which Support Missing Values
PRE-PROCESSING > HANDLE MISSING VALUES
Refer - https://analyticsindiamag.com/5-ways-
handle-missing-values-machine-learning-datasets/

29. ● Resample the training set
○ Under-sampling
○ Over-sampling
● Use K-fold Cross-Validation
● Resample with different ratios
PRE-PROCESSING > HANDLE CLASS IMBALANCE
Refer - https://www.kdnuggets.com/2017/06/7-
techniques-handle-imbalanced-data.html
Synthetic Minority
Oversampling
Technique

30. Machine learning algorithm just sees number — if there is a vast
difference in the range say few ranging in thousands and few ranging
in the tens, and it makes the underlying assumption that higher
ranging numbers have superiority of some sort.
● Min Max Scaler
● Standard Scaler
● Max Abs Scaler
TRANSFORMATION > NORMALIZATION / SCALING
Refer - https://en.wikipedia.org/wiki/Feature_scaling

31. LIBRARIES / TOOLS
Keras is an open-source neural-
network library written in Python. It is
capable of running on top of
TensorFlow, Microsoft Cognitive
Toolkit, R, Theano, or PlaidML.
Designed to enable fast
experimentation with deep neural
networks, it focuses on being user-
friendly, modular, and extensible.
NumPy is a library for the
Python programming language,
adding support for large, multi-
dimensional arrays and
matrices, along with a large
collection of high-level
mathematical functions to
operate on these arrays.
SciPy is a free and open-
source Python library used for
scientific computing and
technical computing. SciPy
contains modules for
optimization, linear algebra,
integration, interpolation,
special functions, FFT, signal
and image processing, ODE
solvers and other tasks
common in science and
engineering.

32. TensorFlow is a free and open-source software library for dataflow
and differentiable programming across a range of tasks. It is a
symbolic math library, and is also used for machine learning
applications such as neural networks.
Pandas is a software library written for the Python
programming language for data manipulation and
analysis. In particular, it offers data structures and
operations for manipulating numerical tables and time
series.
You can use Google Colaboratory to run python code and ML
projects. Colab notebooks allow you to combine executable
code and rich text in a single document, along with images,
HTML, LaTeX and more. Google Colaboratory is a free online
cloud-based Jupyter notebook environment
Visit https://colab.research.google.com

33. DEMOS AND CODE
github.com/nishanc/MLforBeginners

34. [email protected]
@NishanTheDev
NishanChathuranga
THANK
YOU

35. Doesn’t matter how slow you
go, as long as you’re moving
forward