learning is sta/s/cs minus any checking of models and assump/ons.” -‐-‐ Brian Ripley, UseR! 2004 (provoca/vely paraphrased) Copyright 2012 Cloudera Inc. All rights reserved
source operaFng system • “” ConfiguraFon Management • “” CoordinaFon Service • “” File System API • “” Efficient and Extensible File Formats • “” Efficient and Extensible RPC Libraries Copyright 2012 Cloudera Inc. All rights reserved
• Feature Engineering • Model ValidaFon/EvaluaFon • Works Well For Certain Model Fi\ng Problems • CollaboraFve Filtering Algorithms • ExpectaFon MaximizaFon • Decision Trees (PLANET; Gradient Boosted Decision Trees) • Not A PracIcal OpIon for Many Kinds of Problems • Way More Detail in the KDD 2011 Talk Copyright 2012 Cloudera Inc. All rights reserved
learning algorithms • Not machine-‐learning-‐in-‐a-‐box • Custom tweaks/modificaFons are the rule • A disparate collecFon of algorithms for: • RecommendaFons • Clustering • ClassificaFon • Frequent Itemset Mining Copyright 2012 Cloudera Inc. All rights reserved
• Oldest project, most widely-‐deployed in producFon • SVD implementaFon is parFcularly acFve • Good Libraries: Online SGD • Does not use MapReduce • Vowpal Rabbit is faster, has L-‐BFGS opFon • Roll Your Own Instead: Naïve Bayes Copyright 2012 Cloudera Inc. All rights reserved
the allreduce operaFon • N machines each have a number => each machine has the sum of the numbers • At the heart of Vowpal Wabbit’s performance • Implemented in C++ • Can be patched into Apache Hadoop and used today Copyright 2012 Cloudera Inc. All rights reserved
• Defines operaFons on distributed in-‐memory collecFons • Wriken in Scala • Supports reading to and wriFng from HDFS Copyright 2012 Cloudera Inc. All rights reserved
• (but higher than MPI) • Map/Reduce => Update/Sort • Flexible, allows for asynchronous computaFons • Reads from HDFS Copyright 2012 Cloudera Inc. All rights reserved