two stage MR into arbitrary DAGs – Enables in-memory dataset caching – Improved usability • Reduced disk read/writes delivers significant speedups – Especially for iterative algorithms like ML
requirement for tuning • Need to understand memory requirements for caching • Significant performance benefits associated with “getting it right” • Auto-tuning is coming….
How do I adjust intelligently – Number of trees? Depth of trees? Splitting criteria? • Tedious trial and error • Overfitting danger • Intelligent automatic search
• How does the business benefit from the insights? • Operationalization is frequently the weak link – Operationalizing PowerPoint? – Hand rolled scoring flows
– Can introduce more tuning requirements • Provides an opportunity for AutoML – Automatically determine good solutions • Understand when its appropriate • Don’t forget about about operationalization