Pro Yearly is on sale from $80 to $50! »

Big Data for the rest of us

Big Data for the rest of us

Presentation at International Data Engineering and Science Association (IDEAS) SOCAL 2018

B7189c9a09c7d99379c2a343fcfb2dbd?s=128

Lawrence Spracklen

October 20, 2018
Tweet

Transcript

  1. Big Data for the rest of us Lawrence Spracklen SupportLogic

    lawrence@supportlogic.io www.linkedin.com/in/spracklen
  2. SupportLogic • Extract Signals from enterprise CRM systems • Applied

    machine learning • Complete vertical solution • Go-live in days! • We are hiring!
  3. @Scale 2018 • Sound like your Big Data problems? •

    This is Extreme data! • Do these solutions help or hinder Big Data for the rest of us? “Exabytes of data…..” “1500 manual labelers…..” “Sub second global propagation of likes…..”
  4. End-2-End Planning • Numerous steps/obstacles to successfully leveraging ML •

    Data Acquisition • Data Cleansing • Feature Engineering • Model Selection and Training • Model Optimization • Model Deployment • Model Feedback and Retraining • Import to consider all steps before deciding on an approach • Upstream decisions can severely limit downstream options
  5. ML Landscape • How do I build a successful production-grade

    solution from all these disparate components that don’t play well together?
  6. Data Set Availability • Is the necessary data available? •

    Are there HIPAA, PII, GDPR concerns? • Is it spread across multiple systems? • Can the systems communicate? • Data fusion • Move the compute to the data… • Legacy infrastructure decisions can dictate optimal approach
  7. Feature Engineering • Essential for model performance, efficacy, robustness and

    simplicity • Feature extraction • Feature selection • Feature construction • Feature elimination • Dimensionality reduction • Traditionally a laborious manual process • Automation techniques becoming available • e.g. TransmogrifAI, Featuretools • Leverage feature stores!
  8. Model Training • Big differences in the range of algorithms

    offered by different frameworks • Don’t just jump to the most complex! • Easy to automate selection process • Just click ‘go’ • Automate hyperparameter optimization • Beyond the nested for-loop!
  9. Model Ops • What happens after the models are created?

    • How does the business benefit from the insights • Operationalization is frequently the weak link • Operationalizing PowerPoint? • Hand rolled scoring flows?
  10. Barriers to Model Ops • Scoring often performed on a

    different data platform to training • Framework specific persistence formats • Complex data preprocessing requirements • Data cleansing and feature engineering • Batch training versus RT/stream scoring • How frequently are models updated? • How is performance monitored?
  11. Typical Deployments

  12. PMML & PFA • PMML has been long available as

    framework agnostic model representation • Frequently requires helper scripts • PFA is the potential successor…. • Addresses lots of PMML’s shortcomings • Scoring engines accepting R or Python scripts • Easy to use AWS Lambda!
  13. Interpreting Models • A prediction without an explanation limits its

    value • Why is this outcome being predicted? • What action should be taken as a result? • Avoid ML models that are “black Boxes” • Tools for providing prediction explanations are emerging • E.g. LIME
  14. Example LIME output

  15. Prototype in Python • Explore the space! • Work through

    the end-2-end solution • Don’t prematurely optimize • Great Python tooling • e.g. Juypter Notebooks, Cloudera Data Science workbench • Don’t let the data leak to laptops!
  16. Python is slow • Python is simple, flexible and has

    massive available functionality • Pure Python typically hundreds of times slower than C • Many Python implementations leverage C under-the-hood • Even naive Scala or Java implementations are slow
  17. 1000X faster….

  18. Everything Python • Python wrappers are available for most packages

    • Even momentum in Spark is moving to Python • Wrappers for C++ libraries like Shogun
  19. Spark • Optimizing for speed, data size or both? •

    Increasingly rich set of ML algorithms • Still missing common algorithms • E.g. Multiclass GBTs • Not all OSS implementations are good • Hard to correctly resource Spark jobs • Autotuning systems available
  20. System Sizing • Why go multi-node? • CPU or Memory

    constraints • Aggregate data size is very different from the size of the individual data sets • A Data lake can contain Petabytes, but each dataset may be only 10’s of GB…. • Is the raw data bigger or smaller than final data being consumed by the model? • Spark for ETL • Is the algorithm itself parallel?
  21. Single Node ML • Single node memory on even x86

    systems can now measure in tens of terabytes • Likely to expand further with NVDIMMs • 40vCPU, ~1TB x86 only $4/hour on Google Cloud • Many high performance single-node ML libraries exist!
  22. Hive & Postgres • On Hadoop, many data scientists are

    constrained to Hive or Impala for security reasons • Can be very limiting for ‘real’ data science • Hivemall for analytics • Is a traditional DB a better choice? • Better performance in many instances • Apache MadLib for analytics
  23. Conclusions • No one-size fits all! • Much more to

    a successful ML project than a cool model • Not all frameworks play together • Decisions can limit downstream options • Need to think about the problem end-2-end • From data acquisition to model deployment