And Then There Are Algorithms – Part 1

And Then There Are Algorithms – Part 1

Machine Learning for the Enterprise Conference, Rome, October 28th, 2019

Machine Learning = Algorithms + Data + Tools

Part 1


Danilo Poccia

October 28, 2019


  1. © 2019, Amazon Web Services, Inc. or its Affiliates. Danilo

    Poccia Principal Evangelist AWS @danilop And Then There Are Algorithms
  2. Credit: Gerry Cranham/Fox Photos/Getty Images

  3. 1939 London Underground Credit: Gerry Cranham/Fox Photos/Getty Images

  4. Data Predictions

  5. Data Model Predictions

  6. Model

  7. 1959 Arthur Samuel

  8. Machine Learning

  9. Machine Learning Supervised Learning Inferring a model from labeled training

  10. Machine Learning Supervised Learning Unsupervised Learning Inferring a model from

    labeled training data Inferring a model to describe hidden structure from unlabeled data
  11. Reinforcement Learning Perform a certain goal in a dynamic environment

    Machine Learning Supervised Learning Unsupervised Learning
  12. Reinforcem ent Learning Driving a vehicle Playing a game against

    an opponent
  13. Unsupervised Learning Clustering

  14. Unsupervised Learning Clustering

  15. Re:Tip Try topic modeling with your own emails ;-) Unsupervised

    Learning Topic Modeling Discovering abstract “topics” that occur in a collection of documents For example, looking for “infrequent” words that are used more often in a document
  16. Supervised Learning Regression “How many bikes will be rented tomorrow?”

    Happy, Sad, Angry, Confused, Disgusted, Surprised, Calm, Unknown Binary Classification Multi-Class Classification “Is this email spam?” “What is the sentiment of this tweet, or of this social media comment?” 1, 0, 100K Yes / No True / False %
  17. Supervised Learning Training the Model Minimizing the Error of using

    the Model on the Labeled Data
  18. Be Careful of Overfitting Supervised Learning

  19. Be Careful of Overfitting Supervised Learning

  20. Be Careful of Overfitting Supervised Learning

  21. Better Fitting Supervised Learning

  22. Supervised Learning Better Fitting

  23. Supervised Learning Different Models ⇒ Different Predictions

  24. Supervised Learning Validation How well is this Model working on

    New Data?
  25. Supervised Learning Labeled Data

  26. Supervised Learning Labeled Data 70% 30% Training Validation

  27. Algorithms

  28. Letter from Ada Lovelace to Charles Babbage 1843 In this

    letter, Lovelace suggests an example of a calculation which “may be worked out by the engine without having been worked out by human head and hands first”.
  29. Science Museum Group Collection © The Board of Trustees of

    the Science Museum
  30. Diagram of an algorithm for the Analytical Engine for the

    computation of Bernoulli numbers, from Sketch of The Analytical Engine Invented by Charles Babbage by Luigi Menabrea with notes by Ada Lovelace
  31. “You use code to tell a computer what to do.

    Before you write code you need an algorithm. An algorithm is a list of rules to follow in order to solve a problem.” BBC Bitesize What is an Algorithm? By Somepics (Own work) [CC BY-SA 4.0 (], via Wikimedia Commons
  32. Linear Learner Regression Estimate a real valued function Binary Classification

    Predict a 0/1 class Supervised Classification, Regression
  33. Bike Sharing Prediction (Regression) Date Time Temperature (Celsius) Relative Humidity

    Rain (mm/h) Rented Bikes 2018-04-01 08:30 13 64 2 45 2018-04-01 11:30 18 57 0 156 2018-04-02 08:30 14 69 8 87 2018-04-02 11:30 17 73 12 34 … … … … … …
  34. Bike Sharing Prediction (Regression) Date Time Temperature (Celsius) Relative Humidity

    Rain (mm/h) Rented Bikes 2018-04-01 08:30 13 64 2 45 2018-04-01 11:30 18 57 0 156 2018-04-02 08:30 14 69 8 87 2018-04-02 11:30 17 73 12 34 2018-06-14 16:30 23 56 0 ??? Date & Time
  35. Bike Sharing Prediction (Regression) Day of the Year Weekday Public

    Holiday Time (seconds) Temperature (Celsius) Relative Humidity Rain (mm/h) Rented Bikes 91 7 1 30600 13 64 2 45 91 7 1 41400 18 57 0 156 92 1 1 30600 14 69 8 87 92 1 1 41400 17 73 12 34 104 6 0 59400 23 56 0 ??? Date & Time (Feature Engineering)
  36. Linear Learner basis functions basis functions can be nonlinear Supervised

    Classification, Regression
  37. Minimizing the Error you know the expected values (use separate

    datasets for training and validation) this is always positive (convex function) Supervised
  38. Objective Function loss function regularization term measures how predictive our

    model is on your data measures the complexity of the model Supervised
  39. Stochastic Gradient Descent (SGD)'s_function Global Vs Local Minimum

  40. Factorization Machines • It is an extension of a linear

    model that is designed to parsimoniously capture interactions between features within high dimensional sparse datasets • Factorization machines are a good choice for tasks such as click prediction and item recommendation • They are usually trained by stochastic gradient descent (SGD), alternative least square (ALS), or Markov chain Monte Carlo (MCMC) Factorization Machines Steffen Rendle Department of Reasoning for Intelligence The Institute of Scientific and Industrial Research Osaka University, Japan Abstract—In this paper, we introduce Factorization Machines (FM) which are a new model class that combines the advantages of Support Vector Machines (SVM) with factorization models. Like SVMs, FMs are a general predictor working with any real valued feature vector. In contrast to SVMs, FMs model all interactions between variables using factorized parameters. Thus they are able to estimate interactions even in problems with huge sparsity (like recommender systems) where SVMs fail. We show that the model equation of FMs can be calculated in linear time and thus FMs can be optimized directly. So unlike nonlinear SVMs, a transformation in the dual form is not necessary and the model parameters can be estimated directly without the need of any support vector in the solution. We show the relationship to SVMs and the advantages of FMs for parameter estimation in sparse settings. On the other hand there are many different factorization mod- els like matrix factorization, parallel factor analysis or specialized models like SVD++, PITF or FPMC. The drawback of these models is that they are not applicable for general prediction tasks but work only with special input data. Furthermore their model equations and optimization algorithms are derived individually for each task. We show that FMs can mimic these models just by specifying the input data (i.e. the feature vectors). This makes FMs easily applicable even for users without expert knowledge in factorization models. Index Terms—factorization machine; sparse data; tensor fac- torization; support vector machine I. INTRODUCTION Support Vector Machines are one of the most popular predictors in machine learning and data mining. Nevertheless in settings like collaborative filtering, SVMs play no important role and the best models are either direct applications of standard matrix/ tensor factorization models like PARAFAC [1] or specialized models using factorized parameters [2], [3], [4]. In this paper, we show that the only reason why standard SVM predictors are not successful in these tasks is that they cannot learn reliable parameters (‘hyperplanes’) in complex (non-linear) kernel spaces under very sparse data. On the other hand, the drawback of tensor factorization models and even more for specialized factorization models is that (1) they are not applicable to standard prediction data (e.g. a real valued feature vector in Rn.) and (2) that specialized models are usually derived individually for a specific task requiring effort in modelling and design of a learning algorithm. In this paper, we introduce a new predictor, the Factor- ization Machine (FM), that is a general predictor like SVMs but is also able to estimate reliable parameters under very high sparsity. The factorization machine models all nested variable interactions (comparable to a polynomial kernel in SVM), but uses a factorized parametrization instead of a dense parametrization like in SVMs. We show that the model equation of FMs can be computed in linear time and that it depends only on a linear number of parameters. This allows direct optimization and storage of model parameters without the need of storing any training data (e.g. support vectors) for prediction. In contrast to this, non-linear SVMs are usually optimized in the dual form and computing a prediction (the model equation) depends on parts of the training data (the support vectors). We also show that FMs subsume many of the most successful approaches for the task of collaborative filtering including biased MF, SVD++ [2], PITF [3] and FPMC [4]. In total, the advantages of our proposed FM are: 1) FMs allow parameter estimation under very sparse data where SVMs fail. 2) FMs have linear complexity, can be optimized in the primal and do not rely on support vectors like SVMs. We show that FMs scale to large datasets like Netflix with 100 millions of training instances. 3) FMs are a general predictor that can work with any real valued feature vector. In contrast to this, other state-of- the-art factorization models work only on very restricted input data. We will show that just by defining the feature vectors of the input data, FMs can mimic state-of-the-art models like biased MF, SVD++, PITF or FPMC. II. PREDICTION UNDER SPARSITY The most common prediction task is to estimate a function y : Rn → T from a real valued feature vector x ∈ Rn to a target domain T (e.g. T = R for regression or T = {+, −} for classification). In supervised settings, it is assumed that there is a training dataset D = {(x(1), y(1)), (x(2), y(2)), . . .} of examples for the target function y given. We also investigate the ranking task where the function y with target T = R can be used to score feature vectors x and sort them according to their score. Scoring functions can be learned with pairwise training data [5], where a feature tuple (x(A), x(B)) ∈ D means that x(A) should be ranked higher than x(B). As the pairwise ranking relation is antisymmetric, it is sufficient to use only positive training instances. In this paper, we deal with problems where x is highly sparse, i.e. almost all of the elements xi of a vector x are zero. Let m(x) be the number of non-zero elements in the 2010 Supervised Classification, Regression
  41. Factorization Machines Source: 2010 Supervised Classification, Regression ? ?

    ? ? ? ? ?
  42. Vectors ⇾ “Bearer of Information” how much are they related?

  43. Factorization Machines not in a Linear Learner 2010 Supervised Classification,

    Regression Alternative least square (ALS) features
  44. Factorization Machines (k=4) Movie 1 action 2 romantic 3 thriller

    4 horror Blade Runner 0.4 0.3 0.5 0.2 Notting Hill 0.2 0.8 0.1 0.01 Arrival 0.2 0.4 0.6 0.1 But you cannot really control how features are used! 2010 Supervised Classification, Regression Intuitively, each “feature” describes a property of the “items”
  45. K-Nearest Neighbors (k-NN) 1991 Supervised Classification, Regression An Introduction to

    Kernel and Nearest Neighbor Nonparametric Regression N. S. Altman* Biometrics Unit Cornell University Ithaca, NY14853 ABSTRACT Nonparametric regressiOn 1s a set of techniques for estimating a regression curve without making strong assumptions about the shape of the true regression function. These techniques are therefore useful for building and checking parametric models, as well as for data description. Kernel and nearest neighbor regression estimators are local versions of univariate location estimators, and so they can readily be introduced to beginning students, and consulting clients who are familiar with such summaries as the sample mean and median. Key Words: Confidence intervals; Local linear regression; Model building; Model check- ing; Smoothing. *N. S. Altman is Assistant Professor, Biometrics Unit, Cornell University, Ithaca, NY14853. Preparation of this article was partially funded by Hatch Grant 151410 NYF. The author thanks C. McCulloch for comments that substantially improved this arti- cle and L. Molinari and H. Henderson, for providing the data used in Examples B and C. Two anonymous referees and an associate editor provided numerous comments that substantially improved the presentation of the material. 1 By Alisneaky, svg version by User:Zirguezi - Own work, CC BY-SA 4.0,

    OBSERVATIONS J. MACQUEEN UNIVERSITY OF CALIFORNIA, Los ANGELES 1. Introduction The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets on the basis of a sample. The process, which is called 'k-means,' appears to give partitions which are reasonably efficient in the sense of within-class variance. That is, if p is the probability mass function for the population, S = {S1, S2, - * *, Sk} is a partition of EN, and ui, i = 1, 2, * - , k, is the conditional mean of p over the set Si, then W2(S) = ff=ISi f z - u42 dp(z) tends to be low for the partitions S generated by the method. We say 'tends to be low,' primarily because of intuitive considerations, corroborated to some extent by mathematical analysis and practical computa- tional experience. Also, the k-means procedure is easily programmed and is computationally economical, so that it is feasible to process very large samples on a digital computer. Possible applications include methods for similarity grouping, nonlinear prediction, approximating multivariate distributions, and nonparametric tests for independence among several variables. In addition to suggesting practical classification methods, the study of k-means has proved to be theoretically interesting. The k-means concept represents a generalization of the ordinary sample mean, and one is naturally led to study the pertinent asymptotic behavior, the object being to establish some sort of law of large numbers for the k-means. This problem is sufficiently interesting, in fact, for us to devote a good portion of this paper to it. The k-means are defined in section 2.1, and the main results which have been obtained on the asymptotic behavior are given there. The rest of section 2 is devoted to the proofs of these results. Section 3 describes several specific possible applications, and reports some preliminary results from computer experiments conducted to explore the possibilities inherent in the k-means idea. The extension to general metric spaces is indicated briefly in section 4. The original point of departure for the work described here was a series of problems in optimal classification (MacQueen [9]) which represented special This work was supported by the Western Management Science Institute under a grant from the Ford Foundation, and by the Office of Naval Research under Contract No. 233(75), Task No. 047-041. 281 Bulletin de l’acad´ emie polonaise des sciences Cl. III — Vol. IV, No. 12, 1956 MATH´ EMATIQUE Sur la division des corps mat´ eriels en parties 1 par H. STEINHAUS Pr´ esent´ e le 19 Octobre 1956 Un corps Q est, par d´ efinition, une r´ epartition de mati` ere dans l’espace, donn´ ee par une fonction f(P) ; on appelle cette fonction la densit´ e du corps en question ; elle est d´ efinie pour tous les points P de l’espace ; elle est non- n´ egative et mesurable. On suppose que l’ensemble caract´ eristique du corps E =E P {f(P) > 0} est born´ e et de mesure positive ; on suppose aussi que l’int´ egrale de f(P) sur E est finie : c’est la masse du corps Q. On consid` ere comme identiques deux corps dont les densit´ es sont ´ egales ` a un ensemble de mesure nulle pr` es. En d´ ecomposant l’ensemble caract´ eristique d’un corps Q en n sous-ensembles Ei (i = 1, 2, . . . , n) de mesures positives, on obtient une division du corps en question en n corps partiels ; leurs ensembles caract´ eristiques respectifs sont les Ei et leurs densit´ es sont d´ efinies par les valeurs que prend la densit´ e du corps Q dans ces ensembles partiels. En d´ esignant les corps partiels par Qi , on ´ ecrira Q = Q1 + Q2 + . . . + Qn . Quand on donne d’abord n corps Qi , dont les ensembles caract´ eristiques sont disjoints deux ` a deux ` a la mesure nulle pr` es, il existe ´ evidemment un corps Q ayant ces Qi comme autant de parties ; on ´ ecrira Q1 + Q2 + . . . + Qn = Q. Ces remarques su sent pour expliquer la division et la composition des corps. Le probl` eme de cette Note est la division d’un corps en n parties Ki (i = 1, 2, . . . , n) et le choix de n points Ai de mani` ere ` a rendre aussi petite que possible la somme (1) S(K, A) = n X i=1 I(Ki, Ai ) (K ⌘ {Ki }, A ⌘ {Ai }), o` u I(Q, P) d´ esigne, en g´ en´ eral, le moment d’inertie d’un corps quelconque Q par rapport ` a un point quelconque P. Pour traiter ce probl` eme ´ el´ ementaire nous aurons recours aux lemmes suivants : 1. Cet article de Hugo Steinhaus est le premier formulant de mani` ere explicite, en dimen- sion finie, le probl` eme de partitionnement par les k-moyennes (k-means), dites aussi “nu´ ees dynamiques”. Son algorithme classique est le mˆ eme que celui de la quantification optimale de Lloyd-Max. ´ Etant di cilement accessible sous format num´ erique, le voici transduit par Maciej Denkowski, transmis par J´ erˆ ome Bolte, transcrit par Laurent Duval, en juillet/aoˆ ut 2015. Un e↵ort a ´ et´ e fourni pour conserver une proximit´ e avec la pagination originale. 801 1956-1967 Unsupervised Clustering
  47. K-Means Clustering 1956-1967 Unsupervised Clustering Clustering converges when the centers

    “don’t move” anymore
  48. Principal Component Analysis (PCA) • PCA is an unsupervised learning

    algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible • This is done by finding a new set of features called components, which are composites of the original features that are uncorrelated with one another • They are also constrained so that the first component accounts for the largest possible variability in the data, the second component the second most variability, and so on Pearson, K. 1901. On lines and planes of closest fit to systems of points in space. Philosophical Magazine 2:559-572. 1901 Unsupervised Dim ensionality Reduction
  49. Principal Component Analysis (PCA) 1901 Unsupervised Dim ensionality Reduction

  50. Principal Component Analysis (PCA) 1901 Unsupervised Dim ensionality Reduction

  51. XGBoost • Ensemble methods use multiple learning algorithms to improve

    predictions • Boosting: “Can a set of weak learners create a single strong learner?” • eXtreme Gradient Boosting • • • Supports regression, classification, ranking and user defined objectives XGBoost: A Scalable Tree Boosting System Tianqi Chen University of Washington Carlos Guestrin University of Washington ABSTRACT Tree boosting is a highly e↵ective and widely used machine learning method. In this paper, we describe a scalable end- to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quan- tile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compres- sion and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems. Keywords Large-scale Machine Learning 1. INTRODUCTION Machine learning and data-driven approaches are becom- ing very important in many areas. Smart spam classifiers protect our email by learning from massive amounts of spam data and user feedback; advertising systems learn to match the right ads with the right context; fraud detection systems protect banks from malicious attackers; anomaly event de- tection systems help experimental physicists to find events that lead to new physics. There are two important factors that drive these successful applications: usage of e↵ective (statistical) models that capture the complex data depen- dencies and scalable learning systems that learn the model of interest from large datasets. Among the machine learning methods used in practice, gradient tree boosting [10]1 is one technique that shines in many applications. Tree boosting has been shown to give state-of-the-art results on many standard classification benchmarks [16]. LambdaMART [5], a variant of tree boost- ing for ranking, achieves state-of-the-art result for ranking 1Gradient tree boosting is also known as gradient boosting machine (GBM) or gradient boosted regression tree (GBRT) Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). KDD ’16, August 13-17, 2016, San Francisco, CA, USA c 2016 Copyright held by the owner/author(s). ACM ISBN . DOI: problems. Besides being used as a stand-alone predictor, it is also incorporated into real-world production pipelines for ad click through rate prediction [15]. Finally, it is the de- facto choice of ensemble method and is used in challenges such as the Netflix prize [3]. In this paper, we describe XGBoost, a scalable machine learning system for tree boosting. The system is available as an open source package2. The impact of the system has been widely recognized in a number of machine learning and data mining challenges. Take the challenges hosted by the ma- chine learning competition site Kaggle for example. Among the 29 challenge winning solutions 3 published at Kaggle’s blog during 2015, 17 solutions used XGBoost. Among these solutions, eight solely used XGBoost to train the model, while most others combined XGBoost with neural nets in en- sembles. For comparison, the second most popular method, deep neural nets, was used in 11 solutions. The success of the system was also witnessed in KDDCup 2015, where XGBoost was used by every winning team in the top-10. Moreover, the winning teams reported that ensemble meth- ods outperform a well-configured XGBoost by only a small amount [1]. These results demonstrate that our system gives state-of- the-art results on a wide range of problems. Examples of the problems in these winning solutions include: store sales prediction; high energy physics event classification; web text classification; customer behavior prediction; motion detec- tion; ad click through rate prediction; malware classification; product categorization; hazard risk prediction; massive on- line course dropout rate prediction. While domain depen- dent data analysis and feature engineering play an important role in these solutions, the fact that XGBoost is the consen- sus choice of learner shows the impact and importance of our system and tree boosting. The most important factor behind the success of XGBoost is its scalability in all scenarios. The system runs more than ten times faster than existing popular solutions on a single machine and scales to billions of examples in distributed or memory-limited settings. The scalability of XGBoost is due to several important systems and algorithmic optimizations. These innovations include: a novel tree learning algorithm is for handling sparse data; a theoretically justified weighted quantile sketch procedure enables handling instance weights in approximate tree learning. Parallel and distributed com- puting makes learning faster which enables quicker model ex- ploration. More importantly, XGBoost exploits out-of-core 2 3Solutions come from of top-3 teams of each competitions. arXiv:1603.02754v3 [cs.LG] 10 Jun 2016 2016 Supervised Classification, Regression
  52. XGBoost Classification And Regression Trees (CART) 2016 Supervised Classification, Regression

  53. XGBoost Tree Ensemble 2016 Supervised Classification, Regression

  54. Gradient Boosting Gradient boosting: Distance to target – Terence Parr

    and Jeremy Howard
  55. © 2019, Amazon Web Services, Inc. or its Affiliates. Danilo

    Poccia Principal Evangelist AWS @danilop And Then There Are Algorithms