Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning for Complete Beginners

Machine Learning for Complete Beginners

Related to the session conducted for Faculty of IT, University of Moratuwa on 13/06/2020

Nishan Chathuranga

June 13, 2021
Tweet

More Decks by Nishan Chathuranga

Other Decks in Programming

Transcript

  1. MACHINE LEARNING FOR COMPLETE BEGINNERS NISHAN CHATHURANGA Software Engineer 99X

    (Pvt) Ltd. University of Moratuwa Faculty of Information Technology
  2. DEMOS We will do a short demo to understand applications.

    INTRODUCTION TO MODELS ML CONCEPTS Here we are learning basic concepts you need to know 01 Here we are learning different ML problems and models 02 TABLE OF CONTENTS 03
  3. MACHINE LRNING ALGORITHMS ? MODELS ? WTF ? An “algorithm”

    in machine learning is a procedure that is run on data to create a machine learning “model.” Machine learning algorithms perform “pattern recognition.” Algorithms “learn” from data, or are “fit” on a dataset.
  4. ALGORITHM (DECISION TREE) Age<30 Income > $50K Xbox-One Customer Not

    Xbox-One Customer Days Played > 728 Income > $50K Xbox-One Customer Not Xbox-One Customer Xbox-One Customer
  5. ALGORITHM (DECISION TREE) Age<30 Income > $50K Xbox-One Customer Not

    Xbox-One Customer Days Played > 728 Income > $50K Xbox-One Customer Not Xbox-One Customer Xbox-One Customer MODEL (IF-ELSE STATEMENTS WITH SPECIFIC VALUES.)
  6. TYPES OF ML Learn from mistakes Task Driven Predict next

    value Data Driven Identify clusters SUPERVISED LEARNING UNSUPERVISED LEARNING REINFORCEMENT LEARNING
  7. Forecasting Forecasting is the process of making predictions based on

    past and present data and most commonly by analysis of trends. A commonplace example might be estimation of some variable of interest at some specified future date.
  8. Categorical, Binary & Continuous Variables Categorical variables contain a finite

    number of categories or distinct groups. Categorical data might not have a logical order. (e.g : Hair Color – Blonde/ Red/ Brown/ Black) Binary variable is a categorical variable that can only take one of two values, usually represented as a Boolean — True or False (e.g : Gender - Male/Female) Continuous variables are numeric variables that have an infinite number of values between any two values. A continuous variable can be numeric or date/time. (e.g : Age, Temperature) Refer - https://statistics.laerd.com/statistical- guides/types-of-variable.php
  9. Prediction vs Classification Prediction is about predicting a missing/unknown element(continuous

    value) of a dataset. In prediction, a classification/regression model is built to predict the outcome (continuous value). Classification is the prediction of a categorial variable within a predefined vocabulary based on training examples. The prediction of numerical (continuous) variables is called regression.
  10. HEART ATTACK POSSIBILITY age sex cp trestbps chol fbs restecg

    thalach exang oldpeak slope ca thal target 63 1 3 145 233 1 0 150 0 2.3 0 0 1 1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 1 41 0 1 130 204 0 0 172 0 1.4 2 0 2 1 56 1 1 120 236 0 1 178 0 0.8 2 0 2 1 55 1 0 160 289 0 0 145 1 0.8 1 1 3 0 64 1 0 120 246 0 0 96 1 2.2 0 1 2 0 70 1 0 130 322 0 0 109 0 2.4 1 3 2 0 51 1 0 140 299 0 1 173 1 1.6 2 0 3 0 58 1 0 125 300 0 0 171 0 0 2 2 3 0 56 0 0 134 409 0 0 150 1 1.9 1 2 3 We need predict a label of an unlabeled data point The variable we want to predict - CLASS VARIABLE
  11. TRAINING DATA / TEST DATA 25% 75% Trainning Data Testing

    Data DATA SET In a dataset, a training set is implemented to build up a model, while a test (or validation) set is to validate the model built. Test data also has the label.
  12. TRAINING DATA / TEST DATA 55 0 1 132 342

    0 1 166 0 1.2 2 0 2 1 41 1 1 120 157 0 1 182 0 0 2 0 2 1 38 1 2 138 175 0 1 173 0 0 2 4 2 1 38 1 2 138 175 0 1 173 0 0 2 4 2 1 67 1 0 160 286 0 0 108 1 1.5 1 3 2 0 67 1 0 120 229 0 0 129 1 2.6 1 2 3 0 62 0 0 140 268 0 0 160 0 3.6 0 2 2 0 Selecting 25% (or 33%) of the data randomly as test data.
  13. TRAINING DATA / TEST DATA 132 342 0 1 166

    0 1.2 2 0 2 120 157 0 1 182 0 0 2 0 2 138 175 0 1 173 0 0 2 4 2 138 175 0 1 173 0 0 2 4 2 160 286 0 0 108 1 1.5 1 3 2 120 229 0 0 129 1 2.6 1 2 3 140 268 0 0 160 0 3.6 0 2 2 Remove class variable.
  14. TRAINING DATA / TEST DATA 132 342 0 1 166

    0 1.2 2 0 2 1 120 157 0 1 182 0 0 2 0 2 0 138 175 0 1 173 0 0 2 4 2 1 138 175 0 1 173 0 0 2 4 2 1 160 286 0 0 108 1 1.5 1 3 2 0 120 229 0 0 129 1 2.6 1 2 3 0 140 268 0 0 160 0 3.6 0 2 2 1 Predict new class variable using model.
  15. TRAINING DATA / TEST DATA 1 0 1 1 0

    0 1 Compare actual and predicted values for scoring the model 1 1 1 1 0 0 0 Model accuracy 5 7 × 100 = 71.43%
  16. CONFUSION MATRIX A Confusion matrix is an N x N

    matrix used for evaluating the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model. This gives us a holistic view of how well our classification model is performing and what kinds of errors it is making.
  17. CONFUSION MATRIX Not So Confusing! Refer - https://www.dataschool.io/simple-guide-to- confusion-matrix-terminology/ TP

    - True Positive TN - True Negative FP - False Positive – Type 1 Error FN - False Negative – Type 2 Error Positive Class - CATS Negative Class - DOGS
  18. TRAINING DATA / TEST DATA 1 0 1 1 0

    0 1 Compare actual and predicted values for scoring the model 1 1 1 1 0 0 0 Model accuracy 5 7 × 100 = 71.43%
  19. TRAINING DATA / TEST DATA 1 0 1 1 0

    0 1 Compare actual and predicted values for scoring the model 1 1 1 1 0 0 0 TP FN TP TP TN TN FP Positive Class - HEART ATTACK (1) Negative Class – NO HEART ATTACK (0)
  20. WHAT IS MORE DANGEROUS? FN Or FP 1 0 1

    1 0 0 1 actual predicted 1 1 1 1 0 0 0 TP FN TP TP TN TN FP In this case FN’s are more dangerous, because predicting a patient may not have a heart attack while he actually will have a heart attack is not a good prediction.
  21. WHAT IS MORE DANGEROUS? FN Or FP 1 0 1

    1 0 0 1 actual predicted 1 1 1 1 0 0 0 TP FN TP TP TN TN FP In this case FN’s are more dangerous, because predicting a patient may not have a heart attack while he actually will have a heart attack is not a good prediction.
  22. Data pre-processing is an important step in the data mining

    process. Data-gathering methods are often loosely controlled, resulting in out-of- range values, impossible data combinations, and missing values, etc. Sometimes 99% of the work DATA PRE-PROCESSING & TRANSFORMATION
  23. • Deleting Rows • Replacing With Mean/Median/Mode • Assigning An

    Unique Category • Predicting The Missing Values • Using Algorithms Which Support Missing Values PRE-PROCESSING > HANDLE MISSING VALUES Refer - https://analyticsindiamag.com/5-ways- handle-missing-values-machine-learning-datasets/
  24. • Resample the training set ◦ Under-sampling ◦ Over-sampling •

    Use K-fold Cross-Validation • Resample with different ratios PRE-PROCESSING > HANDLE CLASS IMBALANCE Refer - https://www.kdnuggets.com/2017/06/7- techniques-handle-imbalanced-data.html Synthetic Minority Oversampling Technique
  25. Machine learning algorithm just sees number — if there is

    a vast difference in the range say few ranging in thousands and few ranging in the tens, and it makes the underlying assumption that higher ranging numbers have superiority of some sort. • Min Max Scaler • Standard Scaler • Max Abs Scaler TRANSFORMATION > NORMALIZATION / SCALING Refer - https://en.wikipedia.org/wiki/Feature_scaling
  26. LIBRARIES / TOOLS Keras is an open-source neural- network library

    written in Python. It is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, R, Theano, or PlaidML. Designed to enable fast experimentation with deep neural networks, it focuses on being user- friendly, modular, and extensible. NumPy is a library for the Python programming language, adding support for large, multi- dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. SciPy is a free and open- source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering.
  27. TensorFlow is a free and open-source software library for dataflow

    and differentiable programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks. Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. You can use Google Colaboratory to run python code and ML projects. Colab notebooks allow you to combine executable code and rich text in a single document, along with images, HTML, LaTeX and more. Google Colaboratory is a free online cloud-based Jupyter notebook environment Visit https://colab.research.google.com