Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning for Complete Beginners

Machine Learning for Complete Beginners

Related to the session conducted for Faculty of IT, University of Moratuwa on 13/06/2020

Nishan Chathuranga

June 13, 2021
Tweet

More Decks by Nishan Chathuranga

Other Decks in Programming

Transcript

  1. MACHINE LEARNING
    FOR COMPLETE BEGINNERS
    NISHAN CHATHURANGA
    Software Engineer
    99X (Pvt) Ltd.
    University of Moratuwa
    Faculty of Information Technology

    View Slide

  2. DEMOS
    We will do a short
    demo to
    understand
    applications.
    INTRODUCTION
    TO MODELS
    ML CONCEPTS
    Here we are
    learning basic
    concepts you need
    to know
    01
    Here we are
    learning different
    ML problems and
    models
    02
    TABLE OF CONTENTS
    03

    View Slide

  3. WHY MACHINE LEARNING
    VALUE

    View Slide

  4. View Slide

  5. MACHINE LRNING ALGORITHMS ? MODELS ? WTF ?
    An “algorithm” in machine
    learning is a procedure that
    is run on data to create a
    machine learning “model.”
    Machine learning algorithms
    perform “pattern recognition.”
    Algorithms “learn” from data,
    or are “fit” on a dataset.

    View Slide

  6. ALGORITHM (DECISION TREE)
    Age<30
    Income >
    $50K
    Xbox-One
    Customer
    Not Xbox-One
    Customer
    Days Played >
    728
    Income >
    $50K
    Xbox-One
    Customer
    Not Xbox-One
    Customer
    Xbox-One
    Customer

    View Slide

  7. ALGORITHM (DECISION TREE)
    Age<30
    Income > $50K
    Xbox-One
    Customer
    Not Xbox-One
    Customer
    Days Played >
    728
    Income > $50K
    Xbox-One
    Customer
    Not Xbox-One
    Customer
    Xbox-One
    Customer
    MODEL (IF-ELSE STATEMENTS
    WITH SPECIFIC VALUES.)

    View Slide

  8. TYPES OF ML
    Learn from
    mistakes
    Task Driven
    Predict next value
    Data Driven
    Identify clusters
    SUPERVISED
    LEARNING
    UNSUPERVISED
    LEARNING
    REINFORCEMENT
    LEARNING

    View Slide

  9. View Slide

  10. Forecasting
    Forecasting is the process of making predictions based on past and present data
    and most commonly by analysis of trends. A commonplace example might be
    estimation of some variable of interest at some specified future date.

    View Slide

  11. Daily climate analysis in the city of Delhi

    View Slide

  12. Categorical, Binary & Continuous
    Variables
    Categorical variables contain a finite number of categories or distinct groups.
    Categorical data might not have a logical order. (e.g : Hair Color – Blonde/ Red/
    Brown/ Black)
    Binary variable is a categorical variable that can only take one of two values,
    usually represented as a Boolean — True or False (e.g : Gender - Male/Female)
    Continuous variables are numeric variables that have an infinite number of
    values between any two values. A continuous variable can be numeric or
    date/time. (e.g : Age, Temperature)
    Refer - https://statistics.laerd.com/statistical-
    guides/types-of-variable.php

    View Slide

  13. Prediction vs Classification
    Prediction is about predicting a missing/unknown element(continuous value) of
    a dataset. In prediction, a classification/regression model is built to predict the
    outcome (continuous value).
    Classification is the prediction of a categorial variable within a predefined
    vocabulary based on training examples. The prediction of numerical
    (continuous) variables is called regression.

    View Slide

  14. LABELED DATA

    View Slide

  15. HEART ATTACK POSSIBILITY
    age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
    63 1 3 145 233 1 0 150 0 2.3 0 0 1 1
    37 1 2 130 250 0 1 187 0 3.5 0 0 2 1
    41 0 1 130 204 0 0 172 0 1.4 2 0 2 1
    56 1 1 120 236 0 1 178 0 0.8 2 0 2 1
    55 1 0 160 289 0 0 145 1 0.8 1 1 3 0
    64 1 0 120 246 0 0 96 1 2.2 0 1 2 0
    70 1 0 130 322 0 0 109 0 2.4 1 3 2 0
    51 1 0 140 299 0 1 173 1 1.6 2 0 3 0
    58 1 0 125 300 0 0 171 0 0 2 2 3 0
    56 0 0 134 409 0 0 150 1 1.9 1 2 3
    We need predict a label of an unlabeled data point
    The variable we want
    to predict - CLASS VARIABLE

    View Slide

  16. TRAINING DATA / TEST DATA
    25%
    75% Trainning Data
    Testing Data
    DATA SET
    In a dataset, a training set is
    implemented to build up a
    model, while a test (or
    validation) set is to validate the
    model built.
    Test data also has the label.

    View Slide

  17. TRAINING DATA / TEST DATA
    55 0 1 132 342 0 1 166 0 1.2 2 0 2 1
    41 1 1 120 157 0 1 182 0 0 2 0 2 1
    38 1 2 138 175 0 1 173 0 0 2 4 2 1
    38 1 2 138 175 0 1 173 0 0 2 4 2 1
    67 1 0 160 286 0 0 108 1 1.5 1 3 2 0
    67 1 0 120 229 0 0 129 1 2.6 1 2 3 0
    62 0 0 140 268 0 0 160 0 3.6 0 2 2 0
    Selecting 25% (or 33%) of the data randomly as test data.

    View Slide

  18. TRAINING DATA / TEST DATA
    132 342 0 1 166 0 1.2 2 0 2
    120 157 0 1 182 0 0 2 0 2
    138 175 0 1 173 0 0 2 4 2
    138 175 0 1 173 0 0 2 4 2
    160 286 0 0 108 1 1.5 1 3 2
    120 229 0 0 129 1 2.6 1 2 3
    140 268 0 0 160 0 3.6 0 2 2
    Remove class variable.

    View Slide

  19. TRAINING DATA / TEST DATA
    132 342 0 1 166 0 1.2 2 0 2 1
    120 157 0 1 182 0 0 2 0 2 0
    138 175 0 1 173 0 0 2 4 2 1
    138 175 0 1 173 0 0 2 4 2 1
    160 286 0 0 108 1 1.5 1 3 2 0
    120 229 0 0 129 1 2.6 1 2 3 0
    140 268 0 0 160 0 3.6 0 2 2 1
    Predict new class variable using model.

    View Slide

  20. TRAINING DATA / TEST DATA
    1
    0
    1
    1
    0
    0
    1
    Compare actual and predicted values for scoring the model
    1
    1
    1
    1
    0
    0
    0
    Model accuracy
    5
    7
    × 100 = 71.43%

    View Slide

  21. CONFUSION MATRIX
    A Confusion matrix is an N x N
    matrix used for evaluating the
    performance of a classification
    model, where N is the number
    of target classes.
    The matrix compares the
    actual target values with those
    predicted by the machine
    learning model. This gives us a
    holistic view of how well our
    classification model is
    performing and what kinds of
    errors it is making.

    View Slide

  22. CONFUSION MATRIX
    Not So Confusing!
    Refer - https://www.dataschool.io/simple-guide-to-
    confusion-matrix-terminology/
    TP - True Positive
    TN - True Negative
    FP - False Positive – Type 1 Error
    FN - False Negative – Type 2 Error
    Positive Class - CATS
    Negative Class - DOGS

    View Slide

  23. ACCURACY, PRECISION AND RECALL
    Refer - https://towardsdatascience.com/accuracy-
    precision-recall-or-f1-331fb37c5cb9
    Positive Class - CATS
    Negative Class - DOGS

    View Slide

  24. TRAINING DATA / TEST DATA
    1
    0
    1
    1
    0
    0
    1
    Compare actual and predicted values for scoring the model
    1
    1
    1
    1
    0
    0
    0
    Model accuracy
    5
    7
    × 100 = 71.43%

    View Slide

  25. TRAINING DATA / TEST DATA
    1
    0
    1
    1
    0
    0
    1
    Compare actual and predicted values for scoring the model
    1
    1
    1
    1
    0
    0
    0
    TP
    FN
    TP
    TP
    TN
    TN
    FP
    Positive Class - HEART ATTACK (1)
    Negative Class – NO HEART ATTACK (0)

    View Slide

  26. WHAT IS MORE DANGEROUS? FN Or FP
    1
    0
    1
    1
    0
    0
    1
    actual predicted
    1
    1
    1
    1
    0
    0
    0
    TP
    FN
    TP
    TP
    TN
    TN
    FP
    In this case FN’s are more
    dangerous, because predicting a
    patient may not have a heart attack
    while he actually will have a heart
    attack is not a good prediction.

    View Slide

  27. WHAT IS MORE DANGEROUS? FN Or FP
    1
    0
    1
    1
    0
    0
    1
    actual predicted
    1
    1
    1
    1
    0
    0
    0
    TP
    FN
    TP
    TP
    TN
    TN
    FP
    In this case FN’s are more
    dangerous, because predicting a
    patient may not have a heart attack
    while he actually will have a heart
    attack is not a good prediction.

    View Slide

  28. KDD (KNOWLEDGE DISCOVERY IN
    DATABASES) PROCESS

    View Slide

  29. Data pre-processing is an
    important step in the data
    mining process. Data-gathering
    methods are often loosely
    controlled, resulting in out-of-
    range values, impossible data
    combinations, and missing
    values, etc.
    Sometimes 99% of the work
    DATA PRE-PROCESSING &
    TRANSFORMATION

    View Slide

  30. ● Deleting Rows
    ● Replacing With Mean/Median/Mode
    ● Assigning An Unique Category
    ● Predicting The Missing Values
    ● Using Algorithms Which Support Missing Values
    PRE-PROCESSING > HANDLE MISSING VALUES
    Refer - https://analyticsindiamag.com/5-ways-
    handle-missing-values-machine-learning-datasets/

    View Slide

  31. ● Resample the training set
    ○ Under-sampling
    ○ Over-sampling
    ● Use K-fold Cross-Validation
    ● Resample with different ratios
    PRE-PROCESSING > HANDLE CLASS IMBALANCE
    Refer - https://www.kdnuggets.com/2017/06/7-
    techniques-handle-imbalanced-data.html
    Synthetic Minority
    Oversampling
    Technique

    View Slide

  32. View Slide

  33. Machine learning algorithm just sees number — if there is a vast
    difference in the range say few ranging in thousands and few ranging
    in the tens, and it makes the underlying assumption that higher
    ranging numbers have superiority of some sort.
    ● Min Max Scaler
    ● Standard Scaler
    ● Max Abs Scaler
    TRANSFORMATION > NORMALIZATION / SCALING
    Refer - https://en.wikipedia.org/wiki/Feature_scaling

    View Slide

  34. LIBRARIES / TOOLS
    Keras is an open-source neural-
    network library written in Python. It is
    capable of running on top of
    TensorFlow, Microsoft Cognitive
    Toolkit, R, Theano, or PlaidML.
    Designed to enable fast
    experimentation with deep neural
    networks, it focuses on being user-
    friendly, modular, and extensible.
    NumPy is a library for the
    Python programming language,
    adding support for large, multi-
    dimensional arrays and
    matrices, along with a large
    collection of high-level
    mathematical functions to
    operate on these arrays.
    SciPy is a free and open-
    source Python library used for
    scientific computing and
    technical computing. SciPy
    contains modules for
    optimization, linear algebra,
    integration, interpolation,
    special functions, FFT, signal
    and image processing, ODE
    solvers and other tasks
    common in science and
    engineering.

    View Slide

  35. TensorFlow is a free and open-source software library for dataflow
    and differentiable programming across a range of tasks. It is a
    symbolic math library, and is also used for machine learning
    applications such as neural networks.
    Pandas is a software library written for the Python
    programming language for data manipulation and
    analysis. In particular, it offers data structures and
    operations for manipulating numerical tables and time
    series.
    You can use Google Colaboratory to run python code and ML
    projects. Colab notebooks allow you to combine executable
    code and rich text in a single document, along with images,
    HTML, LaTeX and more. Google Colaboratory is a free online
    cloud-based Jupyter notebook environment
    Visit https://colab.research.google.com

    View Slide

  36. DEMOS AND CODE
    github.com/nishanc/MLforBeginners

    View Slide

  37. [email protected]
    @NishanTheDev
    NishanChathuranga
    THANK
    YOU

    View Slide

  38. Doesn’t matter how slow you
    go, as long as you’re moving
    forward

    View Slide