This is Extreme data! • Do these solutions help or hinder Big Data for the rest of us? “Exabytes of data…..” “1500 manual labelers…..” “Sub second global propagation of likes…..”
Data Acquisition • Data Cleansing • Feature Engineering • Model Selection and Training • Model Optimization • Model Deployment • Model Feedback and Retraining • Import to consider all steps before deciding on an approach • Upstream decisions can severely limit downstream options
Are there HIPAA, PII, GDPR concerns? • Is it spread across multiple systems? • Can the systems communicate? • Data fusion • Move the compute to the data… • Legacy infrastructure decisions can dictate optimal approach
offered by different frameworks • Don’t just jump to the most complex! • Easy to automate selection process • Just click ‘go’ • Automate hyperparameter optimization • Beyond the nested for-loop!
• How does the business benefit from the insights • Operationalization is frequently the weak link • Operationalizing PowerPoint? • Hand rolled scoring flows?
different data platform to training • Framework specific persistence formats • Complex data preprocessing requirements • Data cleansing and feature engineering • Batch training versus RT/stream scoring • How frequently are models updated? • How is performance monitored?
framework agnostic model representation • Frequently requires helper scripts • PFA is the potential successor…. • Addresses lots of PMML’s shortcomings • Scoring engines accepting R or Python scripts • Easy to use AWS Lambda!
value • Why is this outcome being predicted? • What action should be taken as a result? • Avoid ML models that are “black Boxes” • Tools for providing prediction explanations are emerging • E.g. LIME
the end-2-end solution • Don’t prematurely optimize • Great Python tooling • e.g. Juypter Notebooks, Cloudera Data Science workbench • Don’t let the data leak to laptops!
massive available functionality • Pure Python typically hundreds of times slower than C • Many Python implementations leverage C under-the-hood • Even naive Scala or Java implementations are slow
Increasingly rich set of ML algorithms • Still missing common algorithms • E.g. Multiclass GBTs • Not all OSS implementations are good • Hard to correctly resource Spark jobs • Autotuning systems available
constraints • Aggregate data size is very different from the size of the individual data sets • A Data lake can contain Petabytes, but each dataset may be only 10’s of GB…. • Is the raw data bigger or smaller than final data being consumed by the model? • Spark for ETL • Is the algorithm itself parallel?
systems can now measure in tens of terabytes • Likely to expand further with NVDIMMs • 40vCPU, ~1TB x86 only $4/hour on Google Cloud • Many high performance single-node ML libraries exist!
constrained to Hive or Impala for security reasons • Can be very limiting for ‘real’ data science • Hivemall for analytics • Is a traditional DB a better choice? • Better performance in many instances • Apache MadLib for analytics
a successful ML project than a cool model • Not all frameworks play together • Decisions can limit downstream options • Need to think about the problem end-2-end • From data acquisition to model deployment