Upgrade to Pro — share decks privately, control downloads, hide ads and more …

And Then There Are Algorithms – Part 1

And Then There Are Algorithms – Part 1

Machine Learning for the Enterprise Conference, Rome, October 28th, 2019

Machine Learning = Algorithms + Data + Tools

Part 1

Danilo Poccia

October 28, 2019
Tweet

More Decks by Danilo Poccia

Other Decks in Programming

Transcript

  1. © 2019, Amazon Web Services, Inc. or its Affiliates.
    Danilo Poccia
    Principal Evangelist
    AWS
    @danilop
    danilop.net
    And Then There Are Algorithms

    View full-size slide

  2. Credit: Gerry Cranham/Fox Photos/Getty Images
    http://www.telegraph.co.uk/travel/destinations/europe/united-kingdom/england/london/galleries/The-history-of-the-Tube-in-pictures-150-years-of-London-Underground/1939-ticket-examin/

    View full-size slide

  3. 1939 London Underground
    Credit: Gerry Cranham/Fox Photos/Getty Images
    http://www.telegraph.co.uk/travel/destinations/europe/united-kingdom/england/london/galleries/The-history-of-the-Tube-in-pictures-150-years-of-London-Underground/1939-ticket-examin/

    View full-size slide

  4. Data Predictions

    View full-size slide

  5. Data Model Predictions

    View full-size slide

  6. http://www.thehudsonvalley.com/articles/60-years-ago-today-local-technology-demonstrated-artificial-intelligence-for-the-first-time
    1959 Arthur Samuel

    View full-size slide

  7. Machine Learning

    View full-size slide

  8. Machine Learning
    Supervised
    Learning
    Inferring a model
    from labeled
    training data

    View full-size slide

  9. Machine Learning
    Supervised
    Learning
    Unsupervised
    Learning
    Inferring a model
    from labeled
    training data
    Inferring a model
    to describe hidden
    structure from
    unlabeled data

    View full-size slide

  10. Reinforcement
    Learning
    Perform a certain
    goal in a
    dynamic
    environment
    Machine Learning
    Supervised
    Learning
    Unsupervised
    Learning

    View full-size slide

  11. Reinforcem
    ent
    Learning
    Driving a vehicle
    Playing a game
    against an opponent

    View full-size slide

  12. Unsupervised
    Learning Clustering

    View full-size slide

  13. Unsupervised
    Learning Clustering

    View full-size slide

  14. Re:Tip Try topic modeling with your own emails ;-)
    Unsupervised
    Learning Topic Modeling
    Discovering abstract “topics”
    that occur in a collection of documents
    For example, looking for “infrequent” words
    that are used more often in a document

    View full-size slide

  15. Supervised
    Learning
    Regression “How many bikes will
    be rented tomorrow?”
    Happy, Sad, Angry,
    Confused, Disgusted,
    Surprised, Calm,
    Unknown
    Binary
    Classification
    Multi-Class
    Classification
    “Is this email spam?”
    “What is the
    sentiment of this
    tweet, or of this social
    media comment?”
    1, 0, 100K
    Yes / No
    True / False
    %

    View full-size slide

  16. Supervised
    Learning
    Training the Model
    Minimizing the Error
    of using the Model on the Labeled Data

    View full-size slide

  17. Be Careful of Overfitting
    Supervised
    Learning

    View full-size slide

  18. Be Careful of Overfitting
    Supervised
    Learning

    View full-size slide

  19. Be Careful of Overfitting
    Supervised
    Learning

    View full-size slide

  20. Better Fitting
    Supervised
    Learning

    View full-size slide

  21. Supervised
    Learning Better Fitting

    View full-size slide

  22. Supervised
    Learning Different Models ⇒ Different Predictions

    View full-size slide

  23. Supervised
    Learning
    Validation
    How well is this Model working on New Data?

    View full-size slide

  24. Supervised
    Learning
    Labeled Data

    View full-size slide

  25. Supervised
    Learning
    Labeled Data
    70%
    30%
    Training
    Validation

    View full-size slide

  26. Letter from Ada Lovelace to Charles Babbage 1843
    In this letter, Lovelace suggests an example of a calculation
    which “may be worked out by the engine without having been
    worked out by human head and hands first”.

    View full-size slide

  27. Science Museum Group Collection
    © The Board of Trustees of the Science Museum

    View full-size slide

  28. Diagram of an algorithm for the Analytical Engine for the computation of Bernoulli numbers, from Sketch of
    The Analytical Engine Invented by Charles Babbage by Luigi Menabrea with notes by Ada Lovelace

    View full-size slide

  29. “You use code to tell a computer what to do.
    Before you write code you need an algorithm.
    An algorithm is a list of rules to follow
    in order to solve a problem.”
    BBC Bitesize
    What is an Algorithm?
    https://commons.wikimedia.org/wiki/File:Euclid_flowchart.svg
    By Somepics (Own work) [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons

    View full-size slide

  30. Linear Learner
    Regression
    Estimate a real valued function
    Binary Classification
    Predict a 0/1 class
    Supervised
    Classification, Regression

    View full-size slide

  31. Bike Sharing Prediction (Regression)
    Date Time
    Temperature
    (Celsius)
    Relative
    Humidity
    Rain (mm/h) Rented Bikes
    2018-04-01 08:30 13 64 2 45
    2018-04-01 11:30 18 57 0 156
    2018-04-02 08:30 14 69 8 87
    2018-04-02 11:30 17 73 12 34
    … … … … … …

    View full-size slide

  32. Bike Sharing Prediction (Regression)
    Date Time
    Temperature
    (Celsius)
    Relative
    Humidity
    Rain (mm/h) Rented Bikes
    2018-04-01 08:30 13 64 2 45
    2018-04-01 11:30 18 57 0 156
    2018-04-02 08:30 14 69 8 87
    2018-04-02 11:30 17 73 12 34
    2018-06-14 16:30 23 56 0 ???
    Date & Time

    View full-size slide

  33. Bike Sharing Prediction (Regression)
    Day of
    the Year
    Weekday
    Public
    Holiday
    Time
    (seconds)
    Temperature
    (Celsius)
    Relative
    Humidity
    Rain
    (mm/h)
    Rented
    Bikes
    91 7 1 30600 13 64 2 45
    91 7 1 41400 18 57 0 156
    92 1 1 30600 14 69 8 87
    92 1 1 41400 17 73 12 34
    104 6 0 59400 23 56 0 ???
    Date & Time (Feature Engineering)

    View full-size slide

  34. Linear Learner
    basis functions
    basis functions can be nonlinear
    Supervised
    Classification, Regression

    View full-size slide

  35. Minimizing the Error
    you know the expected values
    (use separate datasets for
    training and validation)
    this is always positive
    (convex function)
    Supervised

    View full-size slide

  36. Objective Function
    loss
    function
    regularization
    term
    measures
    how predictive
    our model is on
    your data
    measures the
    complexity of
    the model
    Supervised

    View full-size slide

  37. Stochastic Gradient Descent (SGD)
    https://en.wikipedia.org/wiki/Himmelblau's_function
    Global
    Vs
    Local
    Minimum

    View full-size slide

  38. Factorization Machines
    • It is an extension of a linear model that is
    designed to parsimoniously capture
    interactions between features within high
    dimensional sparse datasets
    • Factorization machines are a good choice for
    tasks such as click prediction and item
    recommendation
    • They are usually trained by stochastic gradient
    descent (SGD), alternative least square (ALS),
    or Markov chain Monte Carlo (MCMC)
    Factorization Machines
    Steffen Rendle
    Department of Reasoning for Intelligence
    The Institute of Scientific and Industrial Research
    Osaka University, Japan
    [email protected]
    Abstract—In this paper, we introduce Factorization Machines
    (FM) which are a new model class that combines the advantages
    of Support Vector Machines (SVM) with factorization models.
    Like SVMs, FMs are a general predictor working with any
    real valued feature vector. In contrast to SVMs, FMs model all
    interactions between variables using factorized parameters. Thus
    they are able to estimate interactions even in problems with huge
    sparsity (like recommender systems) where SVMs fail. We show
    that the model equation of FMs can be calculated in linear time
    and thus FMs can be optimized directly. So unlike nonlinear
    SVMs, a transformation in the dual form is not necessary and
    the model parameters can be estimated directly without the need
    of any support vector in the solution. We show the relationship
    to SVMs and the advantages of FMs for parameter estimation
    in sparse settings.
    On the other hand there are many different factorization mod-
    els like matrix factorization, parallel factor analysis or specialized
    models like SVD++, PITF or FPMC. The drawback of these
    models is that they are not applicable for general prediction tasks
    but work only with special input data. Furthermore their model
    equations and optimization algorithms are derived individually
    for each task. We show that FMs can mimic these models just
    by specifying the input data (i.e. the feature vectors). This makes
    FMs easily applicable even for users without expert knowledge
    in factorization models.
    Index Terms—factorization machine; sparse data; tensor fac-
    torization; support vector machine
    I. INTRODUCTION
    Support Vector Machines are one of the most popular
    predictors in machine learning and data mining. Nevertheless
    in settings like collaborative filtering, SVMs play no important
    role and the best models are either direct applications of
    standard matrix/ tensor factorization models like PARAFAC
    [1] or specialized models using factorized parameters [2], [3],
    [4]. In this paper, we show that the only reason why standard
    SVM predictors are not successful in these tasks is that they
    cannot learn reliable parameters (‘hyperplanes’) in complex
    (non-linear) kernel spaces under very sparse data. On the other
    hand, the drawback of tensor factorization models and even
    more for specialized factorization models is that (1) they are
    not applicable to standard prediction data (e.g. a real valued
    feature vector in Rn.) and (2) that specialized models are
    usually derived individually for a specific task requiring effort
    in modelling and design of a learning algorithm.
    In this paper, we introduce a new predictor, the Factor-
    ization Machine (FM), that is a general predictor like SVMs
    but is also able to estimate reliable parameters under very
    high sparsity. The factorization machine models all nested
    variable interactions (comparable to a polynomial kernel in
    SVM), but uses a factorized parametrization instead of a
    dense parametrization like in SVMs. We show that the model
    equation of FMs can be computed in linear time and that it
    depends only on a linear number of parameters. This allows
    direct optimization and storage of model parameters without
    the need of storing any training data (e.g. support vectors) for
    prediction. In contrast to this, non-linear SVMs are usually
    optimized in the dual form and computing a prediction (the
    model equation) depends on parts of the training data (the
    support vectors). We also show that FMs subsume many of
    the most successful approaches for the task of collaborative
    filtering including biased MF, SVD++ [2], PITF [3] and FPMC
    [4].
    In total, the advantages of our proposed FM are:
    1) FMs allow parameter estimation under very sparse data
    where SVMs fail.
    2) FMs have linear complexity, can be optimized in the
    primal and do not rely on support vectors like SVMs.
    We show that FMs scale to large datasets like Netflix
    with 100 millions of training instances.
    3) FMs are a general predictor that can work with any real
    valued feature vector. In contrast to this, other state-of-
    the-art factorization models work only on very restricted
    input data. We will show that just by defining the feature
    vectors of the input data, FMs can mimic state-of-the-art
    models like biased MF, SVD++, PITF or FPMC.
    II. PREDICTION UNDER SPARSITY
    The most common prediction task is to estimate a function
    y : Rn → T from a real valued feature vector x ∈ Rn to a
    target domain T (e.g. T = R for regression or T = {+, −}
    for classification). In supervised settings, it is assumed that
    there is a training dataset D = {(x(1), y(1)), (x(2), y(2)), . . .}
    of examples for the target function y given. We also investigate
    the ranking task where the function y with target T = R can
    be used to score feature vectors x and sort them according to
    their score. Scoring functions can be learned with pairwise
    training data [5], where a feature tuple (x(A), x(B)) ∈ D
    means that x(A) should be ranked higher than x(B). As the
    pairwise ranking relation is antisymmetric, it is sufficient to
    use only positive training instances.
    In this paper, we deal with problems where x is highly
    sparse, i.e. almost all of the elements xi of a vector x are
    zero. Let m(x) be the number of non-zero elements in the
    2010
    Supervised
    Classification, Regression

    View full-size slide

  39. Factorization Machines
    Source: data-artisans.com
    2010
    Supervised
    Classification, Regression
    ? ?
    ?
    ?
    ?
    ?
    ?

    View full-size slide

  40. Vectors ⇾ “Bearer of Information”
    how much are
    they related?

    View full-size slide

  41. Factorization Machines
    not in a Linear Learner
    2010
    Supervised
    Classification, Regression
    Alternative
    least square
    (ALS)
    features

    View full-size slide

  42. Factorization Machines (k=4)
    Movie
    1
    action
    2
    romantic
    3
    thriller
    4
    horror
    Blade Runner 0.4 0.3 0.5 0.2
    Notting Hill 0.2 0.8 0.1 0.01
    Arrival 0.2 0.4 0.6 0.1
    But you cannot really control how features are used!
    2010
    Supervised
    Classification, Regression
    Intuitively, each “feature” describes a property of the “items”

    View full-size slide

  43. K-Nearest Neighbors (k-NN)
    1991
    Supervised
    Classification, Regression
    An Introduction to Kernel and Nearest Neighbor
    Nonparametric Regression
    N. S. Altman*
    Biometrics Unit
    Cornell University
    Ithaca, NY14853
    ABSTRACT
    Nonparametric regressiOn 1s a set of techniques for estimating a regression curve
    without making strong assumptions about the shape of the true regression function. These
    techniques are therefore useful for building and checking parametric models, as well as for
    data description. Kernel and nearest neighbor regression estimators are local versions
    of univariate location estimators, and so they can readily be introduced to beginning
    students, and consulting clients who are familiar with such summaries as the sample
    mean and median.
    Key Words: Confidence intervals; Local linear regression; Model building; Model check-
    ing; Smoothing.
    *N. S. Altman is Assistant Professor, Biometrics Unit, Cornell University, Ithaca,
    NY14853. Preparation of this article was partially funded by Hatch Grant 151410 NYF.
    The author thanks C. McCulloch for comments that substantially improved this arti-
    cle and L. Molinari and H. Henderson, for providing the data used in Examples B and
    C. Two anonymous referees and an associate editor provided numerous comments that
    substantially improved the presentation of the material.
    1
    By Alisneaky, svg version by User:Zirguezi - Own work, CC BY-SA 4.0,
    https://commons.wikimedia.org/w/index.php?curid=47868867

    View full-size slide

  44. K-Means Clustering
    SOME METHODS FOR
    CLASSIFICATION AND ANALYSIS
    OF MULTIVARIATE OBSERVATIONS
    J. MACQUEEN
    UNIVERSITY OF CALIFORNIA, Los ANGELES
    1. Introduction
    The main purpose of this paper is to describe a process for partitioning an
    N-dimensional population into k sets on the basis of a sample. The process,
    which is called 'k-means,' appears to give partitions which are reasonably
    efficient in the sense of within-class variance. That is, if p is the probability mass
    function for the population, S = {S1, S2, -
    * *, Sk} is a partition of EN, and ui,
    i = 1, 2, * - , k, is the conditional mean of p over the set Si, then W2(S) =
    ff=ISi
    f z - u42 dp(z) tends to be low for the partitions S generated by the
    method. We say 'tends to be low,' primarily because of intuitive considerations,
    corroborated to some extent by mathematical analysis and practical computa-
    tional experience. Also, the k-means procedure is easily programmed and is
    computationally economical, so that it is feasible to process very large samples
    on a digital computer. Possible applications include methods for similarity
    grouping, nonlinear prediction, approximating multivariate distributions, and
    nonparametric tests for independence among several variables.
    In addition to suggesting practical classification methods, the study of k-means
    has proved to be theoretically interesting. The k-means concept represents a
    generalization of the ordinary sample mean, and one is naturally led to study the
    pertinent asymptotic behavior, the object being to establish some sort of law of
    large numbers for the k-means. This problem is sufficiently interesting, in fact,
    for us to devote a good portion of this paper to it. The k-means are defined in
    section 2.1, and the main results which have been obtained on the asymptotic
    behavior are given there. The rest of section 2 is devoted to the proofs of these
    results. Section 3 describes several specific possible applications, and reports
    some preliminary results from computer experiments conducted to explore the
    possibilities inherent in the k-means idea. The extension to general metric spaces
    is indicated briefly in section 4.
    The original point of departure for the work described here was a series of
    problems in optimal classification (MacQueen [9]) which represented special
    This work was supported by the Western Management Science Institute under a grant from
    the Ford Foundation, and by the Office of Naval Research under Contract No. 233(75), Task
    No. 047-041.
    281
    Bulletin de l’acad´
    emie
    polonaise des sciences
    Cl. III — Vol. IV, No. 12, 1956
    MATH´
    EMATIQUE
    Sur la division des corps mat´
    eriels en parties 1
    par
    H. STEINHAUS
    Pr´
    esent´
    e le 19 Octobre 1956
    Un corps Q est, par d´
    efinition, une r´
    epartition de mati`
    ere dans l’espace,
    donn´
    ee par une fonction f(P) ; on appelle cette fonction la densit´
    e du corps
    en question ; elle est d´
    efinie pour tous les points P de l’espace ; elle est non-

    egative et mesurable. On suppose que l’ensemble caract´
    eristique du corps
    E =E
    P
    {f(P) > 0} est born´
    e et de mesure positive ; on suppose aussi que
    l’int´
    egrale de f(P) sur E est finie : c’est la masse du corps Q. On consid`
    ere
    comme identiques deux corps dont les densit´
    es sont ´
    egales `
    a un ensemble de
    mesure nulle pr`
    es.
    En d´
    ecomposant l’ensemble caract´
    eristique d’un corps Q en n sous-ensembles
    Ei
    (i = 1, 2, . . . , n) de mesures positives, on obtient une division du corps en
    question en n corps partiels ; leurs ensembles caract´
    eristiques respectifs sont
    les Ei
    et leurs densit´
    es sont d´
    efinies par les valeurs que prend la densit´
    e du
    corps Q dans ces ensembles partiels. En d´
    esignant les corps partiels par Qi
    , on
    ´
    ecrira Q = Q1
    + Q2
    + . . . + Qn
    . Quand on donne d’abord n corps Qi
    , dont les
    ensembles caract´
    eristiques sont disjoints deux `
    a deux `
    a la mesure nulle pr`
    es, il
    existe ´
    evidemment un corps Q ayant ces Qi
    comme autant de parties ; on ´
    ecrira
    Q1
    + Q2
    + . . . + Qn
    = Q. Ces remarques su sent pour expliquer la division et
    la composition des corps.
    Le probl`
    eme de cette Note est la division d’un corps en n parties Ki
    (i = 1, 2, . . . , n) et le choix de n points Ai
    de mani`
    ere `
    a rendre aussi petite que
    possible la somme
    (1) S(K, A) =
    n
    X
    i=1
    I(Ki, Ai
    ) (K ⌘ {Ki
    }, A ⌘ {Ai
    }),
    o`
    u I(Q, P) d´
    esigne, en g´
    en´
    eral, le moment d’inertie d’un corps quelconque Q
    par rapport `
    a un point quelconque P. Pour traiter ce probl`
    eme ´
    el´
    ementaire nous
    aurons recours aux lemmes suivants :
    1. Cet article de Hugo Steinhaus est le premier formulant de mani`
    ere explicite, en dimen-
    sion finie, le probl`
    eme de partitionnement par les k-moyennes (k-means), dites aussi “nu´
    ees
    dynamiques”. Son algorithme classique est le mˆ
    eme que celui de la quantification optimale de
    Lloyd-Max. ´
    Etant di cilement accessible sous format num´
    erique, le voici transduit par Maciej
    Denkowski, transmis par J´
    erˆ
    ome Bolte, transcrit par Laurent Duval, en juillet/aoˆ
    ut 2015. Un
    e↵ort a ´
    et´
    e fourni pour conserver une proximit´
    e avec la pagination originale.
    801
    1956-1967
    Unsupervised
    Clustering

    View full-size slide

  45. K-Means Clustering
    1956-1967
    Unsupervised
    Clustering
    Clustering converges
    when the centers
    “don’t move” anymore

    View full-size slide

  46. Principal Component Analysis (PCA)
    • PCA is an unsupervised learning algorithm that
    attempts to reduce the dimensionality (number
    of features) within a dataset while still retaining
    as much information as possible
    • This is done by finding a new set of features
    called components, which are composites of the
    original features that are uncorrelated with one
    another
    • They are also constrained so that the first
    component accounts for the largest possible
    variability in the data, the second component the
    second most variability, and so on
    Pearson, K. 1901. On lines and planes of closest fit to systems of points in space. Philosophical Magazine 2:559-572.
    http://pbil.univ-lyon1.fr/R/pearson1901.pdf
    1901
    Unsupervised
    Dim
    ensionality
    Reduction

    View full-size slide

  47. Principal Component Analysis (PCA)
    1901
    Unsupervised
    Dim
    ensionality
    Reduction

    View full-size slide

  48. Principal Component Analysis (PCA)
    1901
    Unsupervised
    Dim
    ensionality
    Reduction

    View full-size slide

  49. XGBoost
    • Ensemble methods use multiple learning
    algorithms to improve predictions
    • Boosting: “Can a set of weak learners create a
    single strong learner?”
    • eXtreme Gradient Boosting
    • https://xgboost.ai
    • https://github.com/dmlc/xgboost
    • Supports regression, classification, ranking
    and user defined objectives
    XGBoost: A Scalable Tree Boosting System
    Tianqi Chen
    University of Washington
    [email protected]
    Carlos Guestrin
    University of Washington
    [email protected]
    ABSTRACT
    Tree boosting is a highly e↵ective and widely used machine
    learning method. In this paper, we describe a scalable end-
    to-end tree boosting system called XGBoost, which is used
    widely by data scientists to achieve state-of-the-art results
    on many machine learning challenges. We propose a novel
    sparsity-aware algorithm for sparse data and weighted quan-
    tile sketch for approximate tree learning. More importantly,
    we provide insights on cache access patterns, data compres-
    sion and sharding to build a scalable tree boosting system.
    By combining these insights, XGBoost scales beyond billions
    of examples using far fewer resources than existing systems.
    Keywords
    Large-scale Machine Learning
    1. INTRODUCTION
    Machine learning and data-driven approaches are becom-
    ing very important in many areas. Smart spam classifiers
    protect our email by learning from massive amounts of spam
    data and user feedback; advertising systems learn to match
    the right ads with the right context; fraud detection systems
    protect banks from malicious attackers; anomaly event de-
    tection systems help experimental physicists to find events
    that lead to new physics. There are two important factors
    that drive these successful applications: usage of e↵ective
    (statistical) models that capture the complex data depen-
    dencies and scalable learning systems that learn the model
    of interest from large datasets.
    Among the machine learning methods used in practice,
    gradient tree boosting [10]1 is one technique that shines
    in many applications. Tree boosting has been shown to
    give state-of-the-art results on many standard classification
    benchmarks [16]. LambdaMART [5], a variant of tree boost-
    ing for ranking, achieves state-of-the-art result for ranking
    1Gradient tree boosting is also known as gradient boosting
    machine (GBM) or gradient boosted regression tree (GBRT)
    Permission to make digital or hard copies of part or all of this work for personal or
    classroom use is granted without fee provided that copies are not made or distributed
    for profit or commercial advantage and that copies bear this notice and the full citation
    on the first page. Copyrights for third-party components of this work must be honored.
    For all other uses, contact the owner/author(s).
    KDD ’16, August 13-17, 2016, San Francisco, CA, USA
    c 2016 Copyright held by the owner/author(s).
    ACM ISBN .
    DOI:
    problems. Besides being used as a stand-alone predictor, it
    is also incorporated into real-world production pipelines for
    ad click through rate prediction [15]. Finally, it is the de-
    facto choice of ensemble method and is used in challenges
    such as the Netflix prize [3].
    In this paper, we describe XGBoost, a scalable machine
    learning system for tree boosting. The system is available as
    an open source package2. The impact of the system has been
    widely recognized in a number of machine learning and data
    mining challenges. Take the challenges hosted by the ma-
    chine learning competition site Kaggle for example. Among
    the 29 challenge winning solutions 3 published at Kaggle’s
    blog during 2015, 17 solutions used XGBoost. Among these
    solutions, eight solely used XGBoost to train the model,
    while most others combined XGBoost with neural nets in en-
    sembles. For comparison, the second most popular method,
    deep neural nets, was used in 11 solutions. The success
    of the system was also witnessed in KDDCup 2015, where
    XGBoost was used by every winning team in the top-10.
    Moreover, the winning teams reported that ensemble meth-
    ods outperform a well-configured XGBoost by only a small
    amount [1].
    These results demonstrate that our system gives state-of-
    the-art results on a wide range of problems. Examples of
    the problems in these winning solutions include: store sales
    prediction; high energy physics event classification; web text
    classification; customer behavior prediction; motion detec-
    tion; ad click through rate prediction; malware classification;
    product categorization; hazard risk prediction; massive on-
    line course dropout rate prediction. While domain depen-
    dent data analysis and feature engineering play an important
    role in these solutions, the fact that XGBoost is the consen-
    sus choice of learner shows the impact and importance of
    our system and tree boosting.
    The most important factor behind the success of XGBoost
    is its scalability in all scenarios. The system runs more than
    ten times faster than existing popular solutions on a single
    machine and scales to billions of examples in distributed or
    memory-limited settings. The scalability of XGBoost is due
    to several important systems and algorithmic optimizations.
    These innovations include: a novel tree learning algorithm
    is for handling sparse data; a theoretically justified weighted
    quantile sketch procedure enables handling instance weights
    in approximate tree learning. Parallel and distributed com-
    puting makes learning faster which enables quicker model ex-
    ploration. More importantly, XGBoost exploits out-of-core
    2https://github.com/dmlc/xgboost
    3Solutions come from of top-3 teams of each competitions.
    arXiv:1603.02754v3 [cs.LG] 10 Jun 2016
    2016
    Supervised
    Classification, Regression

    View full-size slide

  50. XGBoost
    Classification And Regression Trees (CART)
    2016
    Supervised
    Classification, Regression

    View full-size slide

  51. XGBoost
    Tree Ensemble
    2016
    Supervised
    Classification, Regression

    View full-size slide

  52. Gradient Boosting
    Gradient boosting: Distance to target – Terence Parr and Jeremy Howard
    https://explained.ai/gradient-boosting/L2-loss.html

    View full-size slide

  53. © 2019, Amazon Web Services, Inc. or its Affiliates.
    Danilo Poccia
    Principal Evangelist
    AWS
    @danilop
    danilop.net
    And Then There Are Algorithms

    View full-size slide