Big Data for the rest of us

Big Data for the rest of us

Presentation at International Data Engineering and Science Association (IDEAS) SOCAL 2018


Lawrence Spracklen

October 20, 2018


  1. Big Data for the rest of us Lawrence Spracklen SupportLogic
  2. SupportLogic • Extract Signals from enterprise CRM systems • Applied

    machine learning • Complete vertical solution • Go-live in days! • We are hiring!
  3. @Scale 2018 • Sound like your Big Data problems? •

    This is Extreme data! • Do these solutions help or hinder Big Data for the rest of us? “Exabytes of data…..” “1500 manual labelers…..” “Sub second global propagation of likes…..”
  4. End-2-End Planning • Numerous steps/obstacles to successfully leveraging ML •

    Data Acquisition • Data Cleansing • Feature Engineering • Model Selection and Training • Model Optimization • Model Deployment • Model Feedback and Retraining • Import to consider all steps before deciding on an approach • Upstream decisions can severely limit downstream options
  5. ML Landscape • How do I build a successful production-grade

    solution from all these disparate components that don’t play well together?
  6. Data Set Availability • Is the necessary data available? •

    Are there HIPAA, PII, GDPR concerns? • Is it spread across multiple systems? • Can the systems communicate? • Data fusion • Move the compute to the data… • Legacy infrastructure decisions can dictate optimal approach
  7. Feature Engineering • Essential for model performance, efficacy, robustness and

    simplicity • Feature extraction • Feature selection • Feature construction • Feature elimination • Dimensionality reduction • Traditionally a laborious manual process • Automation techniques becoming available • e.g. TransmogrifAI, Featuretools • Leverage feature stores!
  8. Model Training • Big differences in the range of algorithms

    offered by different frameworks • Don’t just jump to the most complex! • Easy to automate selection process • Just click ‘go’ • Automate hyperparameter optimization • Beyond the nested for-loop!
  9. Model Ops • What happens after the models are created?

    • How does the business benefit from the insights • Operationalization is frequently the weak link • Operationalizing PowerPoint? • Hand rolled scoring flows?
  10. Barriers to Model Ops • Scoring often performed on a

    different data platform to training • Framework specific persistence formats • Complex data preprocessing requirements • Data cleansing and feature engineering • Batch training versus RT/stream scoring • How frequently are models updated? • How is performance monitored?
  11. Typical Deployments

  12. PMML & PFA • PMML has been long available as

    framework agnostic model representation • Frequently requires helper scripts • PFA is the potential successor…. • Addresses lots of PMML’s shortcomings • Scoring engines accepting R or Python scripts • Easy to use AWS Lambda!
  13. Interpreting Models • A prediction without an explanation limits its

    value • Why is this outcome being predicted? • What action should be taken as a result? • Avoid ML models that are “black Boxes” • Tools for providing prediction explanations are emerging • E.g. LIME
  14. Example LIME output

  15. Prototype in Python • Explore the space! • Work through

    the end-2-end solution • Don’t prematurely optimize • Great Python tooling • e.g. Juypter Notebooks, Cloudera Data Science workbench • Don’t let the data leak to laptops!
  16. Python is slow • Python is simple, flexible and has

    massive available functionality • Pure Python typically hundreds of times slower than C • Many Python implementations leverage C under-the-hood • Even naive Scala or Java implementations are slow
  17. 1000X faster….

  18. Everything Python • Python wrappers are available for most packages

    • Even momentum in Spark is moving to Python • Wrappers for C++ libraries like Shogun
  19. Spark • Optimizing for speed, data size or both? •

    Increasingly rich set of ML algorithms • Still missing common algorithms • E.g. Multiclass GBTs • Not all OSS implementations are good • Hard to correctly resource Spark jobs • Autotuning systems available
  20. System Sizing • Why go multi-node? • CPU or Memory

    constraints • Aggregate data size is very different from the size of the individual data sets • A Data lake can contain Petabytes, but each dataset may be only 10’s of GB…. • Is the raw data bigger or smaller than final data being consumed by the model? • Spark for ETL • Is the algorithm itself parallel?
  21. Single Node ML • Single node memory on even x86

    systems can now measure in tens of terabytes • Likely to expand further with NVDIMMs • 40vCPU, ~1TB x86 only $4/hour on Google Cloud • Many high performance single-node ML libraries exist!
  22. Hive & Postgres • On Hadoop, many data scientists are

    constrained to Hive or Impala for security reasons • Can be very limiting for ‘real’ data science • Hivemall for analytics • Is a traditional DB a better choice? • Better performance in many instances • Apache MadLib for analytics
  23. Conclusions • No one-size fits all! • Much more to

    a successful ML project than a cool model • Not all frameworks play together • Decisions can limit downstream options • Need to think about the problem end-2-end • From data acquisition to model deployment