Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Jupyter Popup: Accelerating the Machine Learning Pipeline on Very Large Datasets with the GPU Data Frame

OmniSci
March 21, 2018
64

Jupyter Popup: Accelerating the Machine Learning Pipeline on Very Large Datasets with the GPU Data Frame

Use of the humble GPU has spiked over the past couple years as machine learning and data analytics workloads have been optimized to take advantage of the GPU’s parallelism and memory bandwidth. Even though these operations (the steps of the Machine Learning Pipeline) could all be run on the same GPUs, they were typically isolated, and much slower than they needed to be, because data was serialized and deserialized between the steps over PCIe.

That inefficiency was recently addressed by the formation of the GPU Open Analytics Initiative (GOAI http://gpuopenanalytics.com/), an industry standard founded by MapD, H2O.ai and Anaconda. This group created the GPU data frame (GDF), based on Apache Arrow, for passing data between processes and keeping it all in the GPU.

In this talk, Aaron will explain how the GDF technology works, show how it is enabling a diverse set of GPU workloads, and demonstrate how to use a Jupyter Notebook to take advantage of it. We’ll demonstrate on a very large dataset how to manage a full Machine Learning Pipeline with minimal data exchange overhead between MapD’s SQL engine and H2O’s generalized linear model library (GLM).

https://www.eventbrite.com/e/jupyter-pop-up-tickets-42550005211

OmniSci

March 21, 2018
Tweet

Transcript

  1. has

  2. Advanced memory management SSD or NVRAM STORAGE (L3) 250GB to

    20TB 1-2 GB/sec CPU RAM (L2) 32GB to 3TB 70-120 GB/sec GPU RAM (L1) 24GB to 256GB 1000-6000 GB/sec Hot Data Speedup = 1500x to 5000x Over Cold Data Warm Data Speedup = 35x to 120x Over Cold Data Cold Data COMPUTE LAYER STORAGE LAYER Data Lake/Data Warehouse/System Of Record