Introduction to Machine Learning

Introduction to Machine Learning

Machine Learning – and its recent developments in Deep Learning – is a wide branch of Artificial Intelligence, with countless applications.

This brief introduction tries to concisely describe its basic concepts – with no claim of completeness.

This work was inspired by the beautiful book "Java Deep Learning Essentials" ( by Yusuke Sugomori - that includes further topics and details, in addition to mathematical descriptions and source code - as well as by Wikipedia.


Gianluca Costa

December 05, 2017


  1. Gianluca Costa Gianluca Costa Introduction to Introduction to Machine Learning

    Machine Learning including Deep Learning including Deep Learning
  2. Preface Preface • Machine Learning – and its recent developments

    in Deep Learning – is a wide branch of Artificial Intelligence, with countless applications. • This brief introduction tries to concisely describe its basic concepts – with no claim of completeness. • This work was inspired by: • the beautiful book Java Deep Learning Essentials by Yusuke Sugomori, that includes further topics and details, in addition to mathematical descriptions and source code • Wikipedia
  3. AI history in brief – 3 ages AI history in

    brief – 3 ages Late 1950s - Classification based on fixed rules Breakthroughs: tree algorithms (Depth-First, Breadth-First, ...) Focus on: efficiency, cutting branches to simplify massive computations, board games. Critical problem: frame problem 1980s – Knowledge Representation Breakthroughs: expert systems, semantic web. Focus on: effective representations, granularity, finding a way to conveniently input an immense knowledge base. Critical problem: symbol grounding problem Machine Learning (ML) – A machine can learn without being explicitly programmed – by tuning the parameters of a model, using training data. Breakthroughs: several methods and algorithms Focus on: pattern recognition, based on important aspects of the dataset (features), multi-class and multi-dimensional processing Critical problem: at first, lack of sufficient amounts of good-quality data (as the machine cannot evaluate such quality); then, feature engineering.
  4. The 3 main problems for AI The 3 main problems

    for AI • Frame problem: a machine cannot take into account the countless details of a realistic problem – as they would require an infinite description and infinite computation. Machines excel at board games because the domain is limited and simplified. • Symbol grounding problem: even if we were able to codify all the knowledge, a machine would still manipulate symbols, not concepts. • Feature engineering: a machine can’t individuate features on its own – they must be provided by the human operator, keeping the domain in mind and often after several iterations.
  5. Concepts and pandas Concepts and pandas PANDA SIGNIFIED SIGNIFIER SIGN

    • You could easily understand that the above pictures depict the same animal even if you didn’t know the term panda – because the human brain can naturally detect features – therefore, the signified itself. • On the other hand, traditional machine learning algorithms just can’t.
  6. Feature engineering as the root Feature engineering as the root

    problem problem • If a machine could automatically detect features, the feature engineering problem would be solved - by definition • Furthermore, the symbol grounding problem would be overcome as well – because identifying features leads to understanding concepts • Finally, a machine able to understand concepts would have no frame problem Deep Learning is a set of machine learning methods enabling machines to automatically detect features. In other words, via Deep Learning, a machine can discover new concepts by itself. As an example, Google announced that a deep neural network was able to learn what a cat is after scanning millions of cat pictures.
  7. Machine Learning at a glance Machine Learning at a glance

    AI Machine Learning Neural Networks Supervised Unsupervised Support Vector Machine Hidden Markov Model Reinforcement Learning Logistic Regression Clustering Perceptron Deep Learning DBN SDA MLP CNN
  8. Training and Test Training and Test LEARNING = TRAINING +

    TEST Iterative process Optimizes the model parameters by using the items in the Training dataset Ensures that the model obtained during the Training phase is effective on another dataset (called Test dataset)
  9. Supervised and Unsupervised Supervised and Unsupervised • Supervised: the items

    in the training dataset are labeled – that is, their expected output is provided. ML methods focused on categorization are usually supervised • Unsupervised: items in the training set are provided with no related output. Unsupervised ML usually focus on the structure of the data: one of the most important examples is clustering
  10. Support Vector Machine (SVM) Support Vector Machine (SVM) • Given

    data items as points in a dimensional space, let’s define the support vector as the items in a class closest to items of another class • We can then define the decision boundary as the hyperplane (or set of hyperplanes) that separates classes so that the sum of distances from each item in the support vector to the boundary is maximum • Kernel trick: if items cannot be linearly separated, it is possible to map them to a higher dimensional space, where linear separation is feasible – but at a higher computational cost.
  11. Logistic Regression Logistic Regression • It is a regression model

    (in statistical analysis) where the dependent variable is categorical – that is, discrete • The model parameters are: – Weight vector – Bias value • Although structurally very different from neural networks, it actually presents several similarities • Can be easily extended to multi-class classification
  12. Neural Networks (NN) Neural Networks (NN) • Wide variety of

    methods inspired by the human brain • Composed by artificial neurons (named units), grouped in layers and connected by weighted links • Neural networks are usually trained by adjusting link weights via an iterative process which employs information on output error to tune the overall network. • Except deep learning, the input of traditional neural network method consists of engineered, manually selected features. • Another common trait is the learning rate (η), a parameter describing how much weights are affected. The learning rate: – Should not be too small, in order to avoid stagnation in local minimum points – Should not be too big, to allow the specific algorithm to converge • There are different techniques (momentum, ADAGRAD, ADADELTA ...) to optimize the learning rate, usually involving a gradual reduction of its value.
  13. Perceptron Perceptron • The very first and simplest feed-forward neural

    network, consisting of just 1 layer of links (between input layer and output layer) • Can only perform linear classification for 2 classes. • If the items are linearly separable, the value of the learning rate is irrelevant, as in such case the perceptron algorithm always converges.
  14. Multi-Layer Perceptron (MLP) Multi-Layer Perceptron (MLP) • Requirement: extending the

    initial perceptron so as to perform non-linear classification – and multi-class as well • Key idea: introducing a new layer between input and output • Input layer and output layer are called visible layers and are not directly connected: between them there is a hidden layer. Units too can be visible and hidden • The weights in the hidden layer are often randomly initialized, to avoid constantly hitting a local maximum • Backpropagation: generalizes the perceptron’s tuning algorithm by calculating the gradient of the error function – reflecting the output error both on the hidden layer and the input layer. • The output layer is often just logistic regression • MLP can approximate any function (both linear and non- linear) – provided there is a suitable number of hidden units
  15. Stochastic Gradient Descent Stochastic Gradient Descent (SGD) (SGD) • Whenever

    the parameters of the model have to be tuned using the gradient descent method, computations can quickly become very intensive • When applying SGD, the gradient is computed on a subset (called minibatch) of the data • Can be applied to different ML methods
  16. Hidden Markov Model Hidden Markov Model • Relies on a

    Markov Stochastic Process, that is a process satisfying the Markov Property → the future state only depends on the present state, with no direct dependency on previous states • Hidden Markov Model is widely employed to detect sequence patterns - such as in gene analysis or Natural Language Processing (NLP)
  17. Reinforcement Learning Reinforcement Learning • Unsupervised method where an agent

    acts on the given environment, altering it and receiving a positive/negative feedback from it • Such feedback alters the agent’s behavior, as it tries to maximize its performance.
  18. Machine Learning – Basic steps Machine Learning – Basic steps

    1.Choose a Machine Learning algorithm 2.Perform feature engineering by selecting the features (= aspects of raw data that will be used as input) and possibly converting them to the format required by the algorithm 3.Split data into a training set and a test set 4.Adjust the model and its parameters in the training phase, until you get acceptable values of the effectiveness indicators. You might consider returning to point 1. 5.Apply the model to the test set, check the effectiveness indicators and return to point 4 until satisfied. You might also consider returning to point 1.
  19. Overfitting Overfitting • The training dataset and the test dataset

    must be non- overlapping – as the purpose of the test is to ensure that the model is not too tailored on the training set → overfitting problem • Overfitting can have 3 main causes: – Noise in the training set; it can be due to unclean data, as well as to exceptional situations – Employing a training set that does not uniformly represent the domain, but only a subset – Ineffective feature engineering • To avoid overfitting: – Increase the number of tests – Increase the size and variety of the dataset • Both conditions can be satisfied by applying techniques such as k-fold cross-validation
  20. Effectiveness Indicators Effectiveness Indicators (for binary output) (for binary output)

    Predicted: YES Predicted: NO Actual: YES True Positive (TP) False Negative (FN) Actual: NO False Positive (FP) True Negative (TN) ACCURACY = TP+TN TP+TN +FP+FN = TRUE PREDICTIONS TOTAL PREDICTIONS PRECISION= TP TP+FP = TRUE POSITIVES TOTAL POSITIVES RECALL= TP TP+FN = TRUE POSITIVES ACTUAL POSITIVES • Effectiveness indicators evaluate the current model during both training and test • The general formulas of such metrics can be applied to models with multi-class output 2-class Confusion Matrix F 1 = 2∗PRECISION∗RECALL PRECISION +RECALL
  21. Limits of neural networks Limits of neural networks • Perceptron

    cannot perform non-linear classifications • Multi-layer perceptron can approximate any function – by adding hidden units, we can express more patterns; of course, this requires time and computational resources, in addition to the risk of overfitting • What about adding further hidden layers?
  22. Vanishing gradient problem Vanishing gradient problem • Most unfortunately, adding

    hidden layers to a neural network without a proper strategy usually degrades effectiveness. • Vanishing gradient problem: backpropagation becomes less and less effective when moving from the output layer through the hidden layers up to the input layer – where it becomes actually ineffective, leaving the weights almost unaltered. It gets worse: • As the number of hidden layers increases • As the number of links increases
  23. Deep Learning – First Idea Deep Learning – First Idea

    The very heart of the first Deep Learning algorithms is a neural network with multiple hidden layers and a final output layer (usually logistic regression). In lieu of standard training, there are 2 steps: PRE-TRAINING FINE-TUNING Individually trains every single hidden layer, starting from the one receving the network input and so that the output of a trained layer becomes the input of the following layer. Pre-training is usually unsupervised, as it focuses on patterns: subsequent layers get more fine-grained details. The actual training method depends on the DL algorithm. Supervised training of the network as a whole, or just of the output layer. The vanishing gradient problem is no more an issue, as weights are almost correct. Should not use the dataset used for pre-training.
  24. Deep Learning – First algorithms Deep Learning – First algorithms

    • Deep Belief Network (DBN): a feed- forward network of Restricted Boltzmann Machines (RBMs), whose theory takes into account the energy of the neural network. • Stacked Denoising Autoencoders (SDA): a feed-forward network of Denoising Autoencoders (DA), which learn by – Encoding: adding noise to the input – Decoding: restoring the original value
  25. Evolution of Deep Learning Evolution of Deep Learning • Pre-training

    can be skipped without degrading effectiveness and even enhancing performances – provided that we introduce other strategies • Valuable ideas: – Dropout →Making the network randomly sparse – Convolutional neural networks (CNN) → Transform pipeline, very effective when dealing with multi-dimensional data - such as 2D or 3D images
  26. Dropout Dropout • Core idea: if backpropagation is affected by

    network density, can we make the network sparse? • At every iteration, dropout randomly turns off hidden units in the network by zeroing the weight of all their connected links → Easy and computationally fast • The process is ruled by a dropout probability – usually an engineering parameter manually chosen • It is similar to the encoding corruption in DAs, but: – Dropout occurs only in hidden layers, not in visible layers – In DAs, the very same corrupt input data is used at every epoch, whereas dropout randomly changes its activation masks
  27. Dropout effectiveness Dropout effectiveness • Dropout works, but usually requires:

    – More iterations than pre-training – A more effective and efficient activation function in layers: a typical sigmoid is not suggested, firstly because it saturates at large values • Rectified Linear Unit (ReLU): f(x) = max(x, 0) Activation function much simpler (and faster to compute) than the sigmoid, preventing saturation and having a simple derivative.
  28. Convolutional Neural Networks Convolutional Neural Networks N-Dimensional Data 1-Dimensional Features

    Traditional ML algorithm Convolutional layers Pooling layers Layer pipeline Each layer applies one or more filters (kernels) to its N-dimensional input. Each kernel obtains M-dimensional output (usually, M <= N) called feature map. Feature maps become the input for the following layers. Convolutional layers actually learn, as kernel values are automatically tuned by the algorithm. Pooling layers behave as passive filters, performing conversions between convolutional layers. Layers can be in any order – although usually the first is convolutional and the last is pooling (to flatten data to 1-D) Backpropagation flows back into the pipeline, leaving pooling layers unaltered and training convolutional layers.
  29. More about CNNs More about CNNs • In lieu of

    fully connected neurons, CNNs are based on convolution and pooling – which are both able to provide translation invariance • Both their structure and the property of translation invariance make convolutional neural networks suitable for image recognition, where they have reached excellent results • They require several engineering choices: – The structure of the pipeline – that is, the type and order of layers – Number of kernels per convolutional layer – Size of each kernel – Pooling functions (e.g., max-pooling or average-pooling) and sizes • Machine Learning in general – and convolutional networks in particular - usually benefit from GPU cards
  30. Deep Learning today Deep Learning today • Deep Learning is

    widely adopted in 2 sectors: – Image recognition/tagging: perhaps one of the most successful applications, where Deep Learning has reached better results than humans – Natural language processing (NLP): very active research field → in particular, Deep Learning in this domain requires dedicated structures in order to deal with time and context
  31. DL in lieu of traditional ML? DL in lieu of

    traditional ML? • Silver bullets do not exist – and Deep Learning is no exception • Deep Learning can reach brilliant effectiveness; however, it has drawbacks: – Despite the considerable research activity, it’s still a very recent field – It is not suitable for simple problems, as it requires huge amounts of data – It is also very demanding in terms of computational time and resources – It requires a considerable amount of trial-and-error iterations in order to setup its many structural parameters • Therefore, Deep Learning usually shines for problems where features are unknown and massive amounts of data are available. • For simple problems, traditional Machine Learning is often a more suitable choice.
  32. A few software libraries A few software libraries • TensorFlow:

    created by Google, written in C++ with APIs also for Java and Python. • Deep Learning for Java: Open-Source, Distributed, Deep Learning Library for the JVM. It employs a fluent, almost declarative API starting from the method NeuralNetConfiguration.Builder(). It also supports convolutional and shallow networks • Caffe: Deep learning framework made with expression, speed, and modularity in mind.
  33. Thanks for your attention! ^__^ Thanks for your attention! ^__^