Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Global Big Data Conference: Speed Meets Scale For Predictive Analytics

OmniSci
April 03, 2018

Global Big Data Conference: Speed Meets Scale For Predictive Analytics

Use of the humble GPU has spiked over the past couple years as machine learning and data analytics workloads have been optimized to take advantage of the GPU’s parallelism and memory bandwidth. Even though these operations (the steps of the Machine Learning Pipeline) could all be run on the same GPUs, they were typically isolated, and much slower than they needed to be, because data was serialized and deserialized between the steps over PCIe. That inefficiency was recently addressed by the formation of the GPU Open Analytics Initiative (GOAI http://gpuopenanalytics.com/), an industry standard founded by MapD, H2O.ai and Anaconda. This group created the GPU data frame (GDF), based on Apache Arrow, for passing data between processes and keeping it all in the GPU. In this talk we will explain how the GDF technology works, show how it is enabling a diverse set of GPU workloads, and demonstrate how to use Python in a Jupyter Notebook to take advantage of it. We’ll demonstrate on a very large dataset how to manage a full Machine Learning Pipeline with minimal data exchange overhead between MapD’s SQL engine and H2O’s generalized linear model library (GLM).

OmniSci

April 03, 2018
Tweet

More Decks by OmniSci

Other Decks in Technology

Transcript

  1. © MapD 2018 Speed Meets Scale For Predictive Analytics: Running

    Billions Of Data Points Through A Full Machine Learning Pipeline Aaron Williams | April 3, 2018
  2. © MapD 2018 Aaron Williams VP of Global Community @_arw_

    [email protected] /in/aaronwilliams/ /williamsaaron slides: https://speakerdeck.com/mapd
  3. © MapD 2018 © MapD 2018 4 “Every business will

    become a software business, build applications, use advanced analytics and provide SaaS services.” - Smart CEO Guy has
  4. © MapD 2018 The Evolution of Data as a Weapon

    5 Collect It Make It Actionable Make it Predictive
  5. © MapD 2018 Advanced memory management Three-tier caching to GPU

    RAM for speed and to SSDs for persistent storage 9 SSD or NVRAM STORAGE (L3) 250GB to 20TB 1-2 GB/sec CPU RAM (L2) 32GB to 3TB 70-120 GB/sec GPU RAM (L1) 24GB to 256GB 1000-6000 GB/sec Hot Data Speedup = 1500x to 5000x Over Cold Data Warm Data Speedup = 35x to 120x Over Cold Data Cold Data COMPUTE LAYER STORAGE LAYER Data Lake/Data Warehouse/System Of Record
  6. © MapD 2018 The GPU Open Analytics Initiative (GOAI) Creating

    common data frameworks to accelerate data science on GPUs 10 /mapd/pymapd /gpuopenanalytics/pygdf
  7. © MapD 2018 Machine Learning Pipeline 11 Personas in Analytics

    Lifecycle (Illustrative) Business Analyst Data Scientist Data Engineer IT Systems Admin Data Scientist / Business Analyst Data Preparation Data Discovery & Feature Engineering Model & Validate Predict Operationalize Monitoring & Refinement Evaluate & Decide GPUs
  8. © MapD 2018 • We’ve published a few notebooks showing

    how to connect to a MapD database and use an ML algorithm to make predictions • We’ve also shared a real-world example of churn, which we implemented with VW 13 ML Examples /gpuopenanalytics/demo-docker /mapd/mapd-ml-demo
  9. © MapD 2018 © MapD 2018 • mapd.com/demos Play with

    our demos • mapd.com/platform/download-community/ Get our free Community Edition and start playing • mapd.com/cloud Get your own instance in 60 seconds • community.mapd.com Ask questions and share your experiences 14 Next Steps
  10. © MapD 2018 Aaron Williams VP of Global Community @_arw_

    [email protected] /in/aaronwilliams/ /williamsaaron slides: https://speakerdeck.com/mapd Thank you! Any questions?