Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Machine Learning

Introduction to Machine Learning

Machine Learning – and its recent developments in Deep Learning – is a wide branch of Artificial Intelligence, with countless applications.

This brief introduction tries to concisely describe its basic concepts – with no claim of completeness.

This work was inspired by the beautiful book "Java Deep Learning Essentials" (https://www.packtpub.com/big-data-and-business-intelligence/java-deep-learning-essentials) by Yusuke Sugomori - that includes further topics and details, in addition to mathematical descriptions and source code - as well as by Wikipedia.

Gianluca Costa

December 05, 2017
Tweet

More Decks by Gianluca Costa

Other Decks in Technology

Transcript

  1. Gianluca Costa
    Gianluca Costa
    Introduction to
    Introduction to
    Machine Learning
    Machine Learning
    including Deep Learning
    including Deep Learning
    http://gianlucacosta.info/

    View full-size slide

  2. Preface
    Preface

    Machine Learning – and its recent developments in
    Deep Learning – is a wide branch of Artificial
    Intelligence, with countless applications.

    This brief introduction tries to concisely describe its basic
    concepts – with no claim of completeness.

    This work was inspired by:

    the beautiful book Java Deep Learning Essentials by
    Yusuke Sugomori, that includes further topics and
    details, in addition to mathematical descriptions and
    source code

    Wikipedia

    View full-size slide

  3. AI history in brief – 3 ages
    AI history in brief – 3 ages
    Late 1950s - Classification based on fixed rules
    Breakthroughs: tree algorithms (Depth-First, Breadth-First, ...)
    Focus on: efficiency, cutting branches to simplify massive
    computations, board games.
    Critical problem: frame problem
    1980s – Knowledge Representation
    Breakthroughs: expert systems, semantic web.
    Focus on: effective representations, granularity, finding a way to
    conveniently input an immense knowledge base.
    Critical problem: symbol grounding problem
    Machine Learning (ML) – A machine can learn without being explicitly
    programmed – by tuning the parameters of a model, using training data.
    Breakthroughs: several methods and algorithms
    Focus on: pattern recognition, based on important aspects of the dataset
    (features), multi-class and multi-dimensional processing
    Critical problem: at first, lack of sufficient amounts of good-quality data
    (as the machine cannot evaluate such quality); then, feature engineering.

    View full-size slide

  4. The 3 main problems for AI
    The 3 main problems for AI

    Frame problem: a machine cannot take into account
    the countless details of a realistic problem – as they
    would require an infinite description and infinite
    computation. Machines excel at board games because
    the domain is limited and simplified.

    Symbol grounding problem: even if we were able to
    codify all the knowledge, a machine would still
    manipulate symbols, not concepts.

    Feature engineering: a machine can’t individuate
    features on its own – they must be provided by the
    human operator, keeping the domain in mind and often
    after several iterations.

    View full-size slide

  5. Concepts and pandas
    Concepts and pandas
    PANDA
    SIGNIFIED SIGNIFIER
    SIGN

    You could easily understand that the above pictures depict the same
    animal even if you didn’t know the term panda – because the human brain
    can naturally detect features – therefore, the signified itself.

    On the other hand, traditional machine learning algorithms just can’t.

    View full-size slide

  6. Feature engineering as the root
    Feature engineering as the root
    problem
    problem

    If a machine could automatically detect features, the feature
    engineering problem would be solved - by definition

    Furthermore, the symbol grounding problem would be
    overcome as well – because identifying features leads to
    understanding concepts

    Finally, a machine able to understand concepts would have no
    frame problem
    Deep Learning is a set of machine learning methods enabling
    machines to automatically detect features.
    In other words, via Deep Learning, a machine can discover new
    concepts by itself. As an example, Google announced that a deep
    neural network was able to learn what a cat is after scanning
    millions of cat pictures.

    View full-size slide

  7. Machine Learning at a glance
    Machine Learning at a glance
    AI
    Machine Learning
    Neural Networks
    Supervised Unsupervised
    Support
    Vector
    Machine
    Hidden Markov
    Model
    Reinforcement
    Learning
    Logistic
    Regression Clustering
    Perceptron
    Deep Learning
    DBN SDA
    MLP
    CNN

    View full-size slide

  8. Training and Test
    Training and Test
    LEARNING = TRAINING + TEST
    Iterative process
    Optimizes the
    model parameters
    by using the items
    in the Training
    dataset
    Ensures that the model
    obtained during the Training
    phase is effective on another
    dataset (called Test dataset)

    View full-size slide

  9. Supervised and Unsupervised
    Supervised and Unsupervised

    Supervised: the items in the training
    dataset are labeled – that is, their
    expected output is provided. ML methods
    focused on categorization are usually
    supervised

    Unsupervised: items in the training set
    are provided with no related output.
    Unsupervised ML usually focus on the
    structure of the data: one of the most
    important examples is clustering

    View full-size slide

  10. Support Vector Machine (SVM)
    Support Vector Machine (SVM)

    Given data items as points in a dimensional space,
    let’s define the support vector as the items in a
    class closest to items of another class

    We can then define the decision boundary as the
    hyperplane (or set of hyperplanes) that separates
    classes so that the sum of distances from each
    item in the support vector to the boundary is
    maximum

    Kernel trick: if items cannot be linearly separated,
    it is possible to map them to a higher dimensional
    space, where linear separation is feasible – but at a
    higher computational cost.

    View full-size slide

  11. Logistic Regression
    Logistic Regression

    It is a regression model (in statistical
    analysis) where the dependent variable is
    categorical – that is, discrete

    The model parameters are:
    – Weight vector
    – Bias value

    Although structurally very different from neural
    networks, it actually presents several
    similarities

    Can be easily extended to multi-class
    classification

    View full-size slide

  12. Neural Networks (NN)
    Neural Networks (NN)

    Wide variety of methods inspired by the human brain

    Composed by artificial neurons (named units), grouped in layers and
    connected by weighted links

    Neural networks are usually trained by adjusting link weights via an
    iterative process which employs information on output error to tune the
    overall network.

    Except deep learning, the input of traditional neural network method
    consists of engineered, manually selected features.

    Another common trait is the learning rate (η), a parameter describing
    how much weights are affected. The learning rate:
    – Should not be too small, in order to avoid stagnation in local minimum points
    – Should not be too big, to allow the specific algorithm to converge

    There are different techniques (momentum, ADAGRAD, ADADELTA ...)
    to optimize the learning rate, usually involving a gradual reduction of its
    value.

    View full-size slide

  13. Perceptron
    Perceptron

    The very first and simplest feed-forward
    neural network, consisting of just 1 layer
    of links (between input layer and
    output layer)

    Can only perform linear classification
    for 2 classes.

    If the items are linearly separable, the
    value of the learning rate is irrelevant, as
    in such case the perceptron algorithm
    always converges.

    View full-size slide

  14. Multi-Layer Perceptron (MLP)
    Multi-Layer Perceptron (MLP)

    Requirement: extending the initial perceptron so as to
    perform non-linear classification – and multi-class as well

    Key idea: introducing a new layer between input and output

    Input layer and output layer are called visible layers and
    are not directly connected: between them there is a hidden
    layer. Units too can be visible and hidden

    The weights in the hidden layer are often randomly
    initialized, to avoid constantly hitting a local maximum

    Backpropagation: generalizes the perceptron’s tuning
    algorithm by calculating the gradient of the error function –
    reflecting the output error both on the hidden layer and the
    input layer.

    The output layer is often just logistic regression

    MLP can approximate any function (both linear and non-
    linear) – provided there is a suitable number of hidden units

    View full-size slide

  15. Stochastic Gradient Descent
    Stochastic Gradient Descent
    (SGD)
    (SGD)

    Whenever the parameters of the model
    have to be tuned using the gradient
    descent method, computations can
    quickly become very intensive

    When applying SGD, the gradient is
    computed on a subset (called minibatch)
    of the data

    Can be applied to different ML methods

    View full-size slide

  16. Hidden Markov Model
    Hidden Markov Model

    Relies on a Markov Stochastic Process,
    that is a process satisfying the Markov
    Property → the future state only
    depends on the present state, with no
    direct dependency on previous states

    Hidden Markov Model is widely employed
    to detect sequence patterns - such as in
    gene analysis or Natural Language
    Processing (NLP)

    View full-size slide

  17. Reinforcement Learning
    Reinforcement Learning

    Unsupervised method where an agent
    acts on the given environment, altering it
    and receiving a positive/negative
    feedback from it

    Such feedback alters the agent’s
    behavior, as it tries to maximize its
    performance.

    View full-size slide

  18. Machine Learning – Basic steps
    Machine Learning – Basic steps
    1.Choose a Machine Learning algorithm
    2.Perform feature engineering by selecting the features
    (= aspects of raw data that will be used as input) and
    possibly converting them to the format required by the
    algorithm
    3.Split data into a training set and a test set
    4.Adjust the model and its parameters in the training
    phase, until you get acceptable values of the
    effectiveness indicators. You might consider
    returning to point 1.
    5.Apply the model to the test set, check the effectiveness
    indicators and return to point 4 until satisfied.
    You might also consider returning to point 1.

    View full-size slide

  19. Overfitting
    Overfitting

    The training dataset and the test dataset must be non-
    overlapping – as the purpose of the test is to ensure that
    the model is not too tailored on the training set →
    overfitting problem

    Overfitting can have 3 main causes:
    – Noise in the training set; it can be due to unclean data, as well
    as to exceptional situations
    – Employing a training set that does not uniformly represent the
    domain, but only a subset
    – Ineffective feature engineering

    To avoid overfitting:
    – Increase the number of tests
    – Increase the size and variety of the dataset

    Both conditions can be satisfied by applying techniques such
    as k-fold cross-validation

    View full-size slide

  20. Effectiveness Indicators
    Effectiveness Indicators
    (for binary output)
    (for binary output)
    Predicted: YES Predicted: NO
    Actual: YES True Positive (TP) False Negative (FN)
    Actual: NO False Positive (FP) True Negative (TN)
    ACCURACY =
    TP+TN
    TP+TN +FP+FN
    =
    TRUE PREDICTIONS
    TOTAL PREDICTIONS
    PRECISION=
    TP
    TP+FP
    =
    TRUE POSITIVES
    TOTAL POSITIVES
    RECALL=
    TP
    TP+FN
    =
    TRUE POSITIVES
    ACTUAL POSITIVES

    Effectiveness indicators
    evaluate the current model
    during both training and test

    The general formulas of such
    metrics can be applied to
    models with multi-class
    output
    2-class
    Confusion
    Matrix
    F
    1
    =
    2∗PRECISION∗RECALL
    PRECISION +RECALL

    View full-size slide

  21. Limits of neural networks
    Limits of neural networks

    Perceptron cannot perform non-linear
    classifications

    Multi-layer perceptron can approximate
    any function – by adding hidden units, we
    can express more patterns; of course, this
    requires time and computational
    resources, in addition to the risk of
    overfitting

    What about adding further hidden
    layers?

    View full-size slide

  22. Vanishing gradient problem
    Vanishing gradient problem

    Most unfortunately, adding hidden layers to a
    neural network without a proper strategy
    usually degrades effectiveness.

    Vanishing gradient problem:
    backpropagation becomes less and less
    effective when moving from the output layer
    through the hidden layers up to the input layer
    – where it becomes actually ineffective, leaving
    the weights almost unaltered. It gets worse:

    As the number of hidden layers increases

    As the number of links increases

    View full-size slide

  23. Deep Learning – First Idea
    Deep Learning – First Idea
    The very heart of the first Deep Learning algorithms is a neural
    network with multiple hidden layers and a final output layer
    (usually logistic regression). In lieu of standard training, there
    are 2 steps:
    PRE-TRAINING FINE-TUNING
    Individually trains every single hidden layer, starting
    from the one receving the network input and so that the
    output of a trained layer becomes the input of the following
    layer.
    Pre-training is usually unsupervised, as it focuses on
    patterns: subsequent layers get more fine-grained details.
    The actual training method depends on the DL algorithm.
    Supervised training of the
    network as a whole, or just of
    the output layer.
    The vanishing gradient
    problem is no more an issue,
    as weights are almost correct.
    Should not use the dataset
    used for pre-training.

    View full-size slide

  24. Deep Learning – First algorithms
    Deep Learning – First algorithms

    Deep Belief Network (DBN): a feed-
    forward network of Restricted
    Boltzmann Machines (RBMs), whose
    theory takes into account the energy of
    the neural network.

    Stacked Denoising Autoencoders
    (SDA): a feed-forward network of
    Denoising Autoencoders (DA), which
    learn by
    – Encoding: adding noise to the input
    – Decoding: restoring the original value

    View full-size slide

  25. Evolution of Deep Learning
    Evolution of Deep Learning

    Pre-training can be skipped without
    degrading effectiveness and even
    enhancing performances – provided
    that we introduce other strategies

    Valuable ideas:
    – Dropout →Making the network randomly
    sparse
    – Convolutional neural networks (CNN) →
    Transform pipeline, very effective when
    dealing with multi-dimensional data - such as
    2D or 3D images

    View full-size slide

  26. Dropout
    Dropout

    Core idea: if backpropagation is affected by network
    density, can we make the network sparse?

    At every iteration, dropout randomly turns off hidden
    units in the network by zeroing the weight of all their
    connected links → Easy and computationally fast

    The process is ruled by a dropout probability –
    usually an engineering parameter manually chosen

    It is similar to the encoding corruption in DAs, but:
    – Dropout occurs only in hidden layers, not in visible layers
    – In DAs, the very same corrupt input data is used at every
    epoch, whereas dropout randomly changes its activation
    masks

    View full-size slide

  27. Dropout effectiveness
    Dropout effectiveness

    Dropout works, but usually requires:
    – More iterations than pre-training
    – A more effective and efficient activation function
    in layers: a typical sigmoid is not suggested, firstly
    because it saturates at large values

    Rectified Linear Unit (ReLU): f(x) = max(x, 0)
    Activation function much simpler (and faster to
    compute) than the sigmoid, preventing saturation
    and having a simple derivative.

    View full-size slide

  28. Convolutional Neural Networks
    Convolutional Neural Networks
    N-Dimensional
    Data
    1-Dimensional
    Features
    Traditional
    ML
    algorithm
    Convolutional
    layers
    Pooling
    layers
    Layer pipeline
    Each layer applies one or more filters
    (kernels) to its N-dimensional input.
    Each kernel obtains M-dimensional
    output (usually, M <= N) called feature
    map.
    Feature maps become the input for the
    following layers.
    Convolutional layers actually learn, as
    kernel values are automatically tuned by
    the algorithm.
    Pooling layers behave as passive filters,
    performing conversions between
    convolutional layers.
    Layers can be in any order – although
    usually the first is convolutional and the
    last is pooling (to flatten data to 1-D)
    Backpropagation flows back
    into the pipeline, leaving
    pooling layers unaltered and
    training convolutional layers.

    View full-size slide

  29. More about CNNs
    More about CNNs

    In lieu of fully connected neurons, CNNs are based on
    convolution and pooling – which are both able to provide
    translation invariance

    Both their structure and the property of translation invariance
    make convolutional neural networks suitable for image
    recognition, where they have reached excellent results

    They require several engineering choices:
    – The structure of the pipeline – that is, the type and order of
    layers
    – Number of kernels per convolutional layer
    – Size of each kernel
    – Pooling functions (e.g., max-pooling or average-pooling) and
    sizes

    Machine Learning in general – and convolutional networks in
    particular - usually benefit from GPU cards

    View full-size slide

  30. Deep Learning today
    Deep Learning today

    Deep Learning is widely adopted in 2
    sectors:
    – Image recognition/tagging: perhaps one
    of the most successful applications, where
    Deep Learning has reached better results
    than humans
    – Natural language processing (NLP): very
    active research field → in particular, Deep
    Learning in this domain requires dedicated
    structures in order to deal with time and
    context

    View full-size slide

  31. DL in lieu of traditional ML?
    DL in lieu of traditional ML?

    Silver bullets do not exist – and Deep Learning is no exception

    Deep Learning can reach brilliant effectiveness; however, it has
    drawbacks:
    – Despite the considerable research activity, it’s still a very recent field
    – It is not suitable for simple problems, as it requires huge amounts of data
    – It is also very demanding in terms of computational time and resources
    – It requires a considerable amount of trial-and-error iterations in order to
    setup its many structural parameters

    Therefore, Deep Learning usually shines for problems where
    features are unknown and massive amounts of data are available.

    For simple problems, traditional Machine Learning is often a more
    suitable choice.

    View full-size slide

  32. A few software libraries
    A few software libraries

    TensorFlow: created by Google, written in C++ with APIs
    also for Java and Python. https://www.tensorflow.org/

    Deep Learning for Java: Open-Source, Distributed, Deep
    Learning Library for the JVM. https://deeplearning4j.org/
    It employs a fluent, almost declarative API starting from the
    method NeuralNetConfiguration.Builder().
    It also supports convolutional and shallow networks

    Caffe: Deep learning framework made with expression,
    speed, and modularity in mind.
    http://caffe.berkeleyvision.org/

    View full-size slide

  33. Thanks for your attention! ^__^
    Thanks for your attention! ^__^

    View full-size slide