Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using GPUs for Lightning Fast Analytics on MapD

OmniSci
October 25, 2017

Using GPUs for Lightning Fast Analytics on MapD

GPU-powered in-memory databases and analytics platforms are the logical successors to CPU in-memory systems, largely due to recent increases in the onboard memory available on GPUs. With sufficient memory, GPUs possess numerous advantages over CPUs, including much greater compute and memory bandwidth, as well as a native graphics pipeline for visualization.

In this tutorial, Aaron Williams, VP of Community at MapD, will demo how MapD is able to leverage multiple GPUs per server to extract orders-of-magnitude performance increases over CPU-based systems, bringing interactive querying and visualization to multi-billion (with a ‘b’) row datasets.

OmniSci

October 25, 2017
Tweet

More Decks by OmniSci

Other Decks in Technology

Transcript

  1. MapD: Extreme Analytics 2 100x Faster Queries MapD Core The

    world’s fastest columnar database, powered by GPUs https://github.com/mapd + Visualization at the Speed of Thought MapD Immerse A visualization front end that leverages the speed & rendering superiority of GPUs
  2. GPUs Recharge Moore’s Law 5 GPU Processing Power 50% per

    year Data Growth 40% per year CPU Processing Power 20% per year
  3. Core Density Makes a Huge Difference 6 GPU Processing CPU

    Processing 39,000+ Cores 20 Cores *fictitious example Latency Throughput CPU 1 ns per task (1 task/ns) x (20 cores) = 20 tasks/ns GPU 10 ns per task (0.1 task per ns) x (40,000 cores) = 4,000 task per ns Latency: Time to do a task. | Throughput: Number of tasks per unit time.
  4. MapD Benchmarks Blogger Mark Litwintschik benchmarked MapD on a billion-row

    taxi data set and found it to be 6x to 12,500x faster than the fastest CPU databases 7 MapD Core: Comparative Query Acceleration* System Q 1 Q 2 Q 3 Q 4 BrytlytDB & 2-node p2.16xlarge cluster 36x 47x 25x 12x ClickHouse, Intel Core i5 4670K 49x 58x 32x 25x Redshift, 6-node ds2.8xlarge cluster 74x 24x 14x 6x BigQuery 95x 38x 6x 6x Presto, 50-node n1-standard-4 cluster 190x 75x 61x 41x Amazon Athena 305x 117x 37x 13x Elasticsearch (heavily tuned) 386x 343x n/a n/a Spark 2.1, 11 x m3.xlarge cluster w/ HDFS 485x 153x 119x 169x Presto, 10-node n1-standard-4 cluster 524x 189x 127x 61x Vertica, Intel Core i5 4670K 685x 607x 203x 132x Elasticsearch (lightly tuned) 1,642x 1,194x n/a n/a Presto, 5-node m3.xlarge cluster w/ HDFS 1,667x 735x 388x 159x Presto, 50-node m3.xlarge cluster w/ S3 2,048x 849x 164x 86x PostgreSQL 9.5 & cstore_fdw 7,238x 3,302x 1,424x 722x Spark 1.6, 5-node m3.xlarge cluster w/ S3 12,571x 5,906x 3,758x 1,884x *All speed comparisons are to the “MapD & 1 Nvidia Pascal DGX-1” benchmark Source: http://tech.marksblogg.com/benchmarks.html
  5. 8 MapD Core The world's fastest in-memory GPU database powers

    the world's most immersive data exploration experience
  6. Keeping Data Close to Compute MapD maximizes performance by optimizing

    memory use 9 SSD or NVRAM STORAGE (L3) 250GB to 20TB 1-2 GB/sec CPU RAM (L2) 32GB to 3TB 70-120 GB/sec GPU RAM (L1) 24GB to 256GB 1000-6000 GB/sec Hot Data Speedup = 1500x to 5000x Over Cold Data Warm Data Speedup = 35x to 120x Over Cold Data Cold Data COMPUTE LAYER STORAGE LAYER Data Lake/Data Warehouse/System Of Record Speed Increases Space Increases
  7. Query Compilation with LLVM 10 Traditional DBs can be highly

    inefficient • each operator in SQL treated as a separate function • incurs tremendous overhead and prevents vectorization MapD compiles queries w/LLVM to create one custom function • Queries run at speeds approaching hand-written functions • LLVM enables generic targeting of different architectures (GPUs, X86, ARM, etc). • Code can be generated to run query on CPU and GPU simultaneously 10111010101001010110101101010101 00110101101101010101010101011101 LLVM
  8. MapD Immerse Using a hybrid approach to speed and scale

    visualization 12 Basic charts are frontend rendered using D3 and other related toolkits Scatterplots, pointmaps + polygons are backend rendered using the Iris Rendering Engine on GPUs Geo-Viz is composited over a frontend rendered basemap
  9. Server Side Rendering Process 13 Backend Query-to- Render PNG Vega

    Frontend Data goes from compute (CUDA) to graphics (OpenGL) pipeline without copy and comes back as compressed PNG (~100 KB) rather than raw data (> 1GB)
  10. HIT Testing 15 Render-to-data operation to get a row id

    • Use an auxiliary integer buffer to store row ids per-pixel • Use PBOs for GPU-to-CPU transfer for caching • Apply a gaussian-weighted kernel to resolve hits near boundaries Run a SQL query using row id as filter 0 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0 0 1 1 1 2 2 2 0 2 0 0 0 0 0
  11. Today’s MapD Geospatial Use Cases 16 TELECOMMUNICATIONS Predictive network performance

    ENERGY Dynamic oil well management FEDERAL Cyber-security TELEMATICS Location-based fleet management ADTECH Location-based ad effectiveness SOCIAL Location modeling
  12. Collaboration with Harvard CGA 17 First Project: Improving access to

    hydrological models used in water management and public safety Faster Visualization of Datasets from the U.S. National Water Model Contributing to a Vibrant Open Source Community Center for Geographic Analysis at Harvard – Accelerating Geospatial Research
  13. What’s Coming Next 18 New native data types New spatial

    analytics functions High-performance clustering and joins Standards • ISO/IEC 13249-3 • Open Geospatial Consortium • 1999 SQL/MM Any guinea pigs out there?
  14. Try MapD Tonight! It’s free and it’s easy 19 Play

    with the live demos: https://www.mapd.com/demos/ Download the Community Edition: https://www.mapd.com/platform/download-community/ Join our forums: https://community.mapd.com/ Review these slides: https://speakerdeck.com/mapd
  15. AWS Credits Available 20 Free GPU Compute! We’re looking for

    interesting geospatial use cases. Email Aaron Williams ([email protected]) with your ideas!