Upgrade to Pro — share decks privately, control downloads, hide ads and more …

When HPC Meets Big Data: Emerging HPC Technologies for Real-Time Data Analytics

SciTech
June 15, 2015

When HPC Meets Big Data: Emerging HPC Technologies for Real-Time Data Analytics

Big data has become a buzz word. Among various big-data challenges, real-time data analytics has been identified as one of the most exciting and promising areas for both academia and industry. We are facing the challenges at all levels ranging from sophisticated algorithms and procedures to mine the gold from massive data to high-performance computing (HPC) techniques and systems to get the useful data in time. Our research has been on the system design and implementation of HPC technologies as weapons to address the performance requirement of real-time data analytics. Interestingly, we have also observed the interplay between HPC and real-time data analytics, where real-time data analytics also poses significant challenges to the design and implementation of HPC technologies. In this talk, I will present our recent research efforts in developing real-time data analytics system with GPUs and on Cloud. Finally, I will outline our research agenda. More details about our research can be found at http://pdcc.ntu.edu.sg/xtra/.

SciTech

June 15, 2015
Tweet

More Decks by SciTech

Other Decks in Technology

Transcript

  1. When HPC Meets Big Data: Emerging HPC Technologies for High-Performance

    Data Management Systems Bingsheng He Nanyang Technological University 1
  2. The Big Picture • Big data is not just big.

    – High performance is a must, not an option. – “One size does not fit all”. • High performance computing (HPC) hardware & software architectures: parallelism and heterogeneity. – Scale up: multicore, many-core, . . . – Scale out: cluster, cloud, . . . – Heterogeneity is common in both hardware and software architectures. • We report our experience and insights on leveraging emerging HPC technologies for high- performance data management systems. 2
  3. Outline • Motivations – Emerging HPC techniques – 3 ANYs

    in big data • Our experience on building high-performance data management systems – GPGPU for real-time data analytics – Scalable and efficient cloud infrastructures for big data • Summary • Ongoing and future work 3
  4. Emerging HPC Hardware: Parallelism and Heterogeneity • Towards many cores

    • From CPU to accelerators (co-processors) Figures are adopted from Intel, NVIDIA and Altera. >> >> >> Dual cores Multi-core array Scalar plus many cores Many-core array GPU Xeon Phi FPGA 4
  5. Emerging HPC Hardware: Parallelism and Heterogeneity (Cont’) • Towards tightly

    coupled heterogeneous systems Figures are adopted from AMD, Intel and Altera. Intel-Altera Heterogeneous Accelerators AMD APU … 5
  6. Cloud as Software Infrastructure: Parallelism and Heterogeneity • Pay-as-you-go virtual

    cluster • Heterogeneous virtual machine (VM) offerings – Amazon offers 47 on- demand and 39 spot types. • Heterogeneity observations – Observation 1: VMs with the same type have very different actual computational capability, I/O and network performance. The distribution of the average VM-to- VM network bandwidth in a virtual cluster of 200 medium VMs on Amazon EC2 for one week. 6
  7. Cloud as Software Infrastructure: Parallelism and Heterogeneity • Pay-as-you-go virtual

    cluster • Heterogeneous virtual machine (VM) offerings – Amazon offers 47 on- demand and 39 spot types. • Heterogeneity observations – Observation 1: VMs with the same type have very different actual computational capability, I/O and network performance. – Observation 2: the I/O and network bandwidth of the same VM fluctuate significantly. Consecutive measurements on the same medium VM pair for one week. 7
  8. 3 ANYs in Big Data • From enterprises to anyone

    – Internet of things, mobile, NGS (next gen sequencing)… • From structured data to any form – Data warehouse, text, streaming, graphs, JSON … • From SQL to any analytics/processing – MapReduce, R, eScience… “One size does not fit all” 8
  9. When HPC Meets Big Data (Emerging) data-intensive applications Emerging hardware

    & software architectures • Performance • Programmability • Energy consumption • User interfaces • … System issues: Vision: Pervasive HPC for big data Anyone can leverage HPC to tame the big data challenges anytime and anywhere. 9
  10. Outline • Motivations – Emerging HPC techniques – 3 ANYs

    in big data • Our experience on building high-performance data management systems – GPGPU for real-time data analytics – Scalable and efficient cloud infrastructures for big data • Summary • Ongoing and future work 10
  11. Our Expeditions on Emerging HPC Technologies • GPGPU for real-time

    data analytics [SIGMOD 08/11, SC07, PACT 08, VLDB 10/11/13 (1+2demo)/14/15 (2), TPDS (5) …] • Scalable and efficient cloud infrastructures for big data [HPDC15, SC14(2+1poster), ICS14, IPDPS14, SoCC 10/12, CIDR13, CLUSTER13, SIGMOD2010demo, TPDS (4), TCC (4) …] 11
  12. When HPC Meets Big Data Performance requirement “Small” “Big” Data

    footprint “Low” “High” “Trivial” Cluster and cloud computing GPGPU, and other emerging hardware Hardware Accelerated Cloud 12
  13. Outline • Motivations – Emerging HPC techniques – 3 ANYs

    in big data • Our experience on building high-performance data management systems – GPGPU for real-time data analytics – Scalable and efficient cloud infrastructures for big data • Summary • Ongoing and future work 13
  14. GPU Accelerations • GPU has much higher memory bandwidth than

    CPU. • Massive thread parallelism of GPU fits well for data parallel processing. Device memory GPU CPU Main memory P1 P2 Pn Multiprocessor 1 Local memory P1 P2 Pn Multiprocessor N Local memory PCI-E 14
  15. NVIDIA GPUs Tesla K80 Tesla K40 Tesla K20 Tesla C2050

    Stream Processors (Core) 2 x 2496 2880 2496 448 Core Clock 562MHz 745MHz 706MHz 1.15GHz Memory Clock 6GHz GDDR5 6GHz GDDR5 5.2GHz GDDR5 1.5GHz GDDR5 VRAM 2 x 12GB 12GB 5GB 3GB Single Precision 8.74 TFLOPS 4.29 TFLOPS 3.52 TFLOPS 1.03 TFLOPS 2 x 240GB/sec * GPU hardware power grows faster than Moore’s law. 15
  16. Our Experiences in GPGPU-based Data Management Systems CUDA was released

    in Feb. 2007 GPUQP (GDB) accepted in SIGMOD 2008 (“best papers”) Mars (GPU-based MapReduce) accepted in PACT 2008 (2nd top cited paper in PACT)* • http://arnetminer.org/conference/pact-124.html • Thanks to my advisor, colleagues, and students. Mars has been extended to AMD GPU and Hadoop (TPDS10) GDB supports compressed column- based processing (VLDB10) Medusa: GPU- based graph processing (TPDS13/14, VLDB13 best demo, CloudCom13) OmniDB: relational database on coupled CPU/GPU architectures (VLDB’13/14/15, VLDB’13 demo,…) Transaction executions on GDB (VLDB11) 16
  17. Other Relevant Research on Emerging Architectures • OmniDB on coupled

    CPU-GPU architectures (e.g., AMD APU) – Fine-grained query co-processing [VLDB2015]. – Portable query processing [VLDB 2013 demo] – Pipelining GPU query co-processing [in preparation] • PhiDB on Intel Xeon Phi – Improving hash join performance [VLDB 15] • ReconfigDB on OpenCL-based FPGA. – Improving hash join performance [FPL2015] – FPGA-aware aware database design [in preparation]. 17
  18. OmniDB: Optimized GPU Query Co- Processing on Coupled CPU-GPU Architectures

    Jiong He*, Shuhao Zhang*, Bingsheng He. In-Cache Query Co-Processing on Coupled CPU-GPU Architectures. PVLDB/VLDB 2015. Jiong He*, Mian Lu, Bingsheng He. Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture. PVLDB/VLDB 2013. http://pdcc.ntu.edu.sg/xtra/proj-jiong.html 18
  19. The Coupled Architecture CPU GPU Cache Main Memory • Coupled

    CPU-GPU architecture – Intel Sandy Bridge, AMD Fusion APU, etc. • New opportunities – Remove the PCI-e data transfer overhead – Enable fine-grained workload scheduling – Cache reuse 20
  20. Challenges Come with Opportunities • Efficient data sharing among the

    CPU and the GPU – Share main memory – Share last-level cache (LLC) • Keep both processors busy – The GPU cannot dominate the performance, since its capability is limited to the chip area. – How to assign suitable tasks to the CPU/GPU for maximum speedup. Our VLDB’13/15 papers study how to address those challenges with fine-grained co-processing mechanisms. 21
  21. Fine-grained vs. Coarse-grained in Hash Joins PL outperforms OL &

    DD significantly for both hash joins with and without partitioning (PHJ and SHJ, respectively). 0 0.5 1 1.5 2 2.5 3 SHJ PHJ Elapsed time (s) OL (GPU-only) DD PL (Fine-grained) 22
  22. Outline • Motivations – Emerging HPC techniques – 3 ANYs

    in big data • Our experience on building high-performance data management systems – GPGPU for real-time data analytics – Scalable and efficient cloud infrastructures for big data • Summary • Ongoing and future work 24
  23. Our Experiences in Cloud-based Data Management Systems Cloud infrastructures [HotCloud’10,

    ICPP11, SC14, ICS14, …] HPC Cloud [TPDS13, SC14…] Scientific Workflow [HPDC15, TCC13/14, CloudCom14 best Ph.D.] Big data management systems [SoCC10/12, CIDR13, CLUSTER13, TCC14…] Domain specific applications (e.g., water quality monitoring project) [IPSN14, SECON14 best demo] 25
  24. Deco: A Declarative Optimization Engine for Resource Provisioning of Scientific

    Workflows in IaaS Clouds Amelie Chi Zhou*, Bingsheng He, Xuntao Cheng*, Chiew Tong Lau, A Declarative Optimization Engine for Resource Provisioning of Scientific Workflows in IaaS Clouds. ACM HPDC 2015. [19 out of 116] http://pdcc.ntu.edu.sg/xtra/deco/index.html 26
  25. Scientific Workflows as Big-Data Applications • Workflows may handle massive

    input data. – Loosely-coupled via data dependency. – Tasks have very different I/O and computational behaviors. • Real-world workflows – Montage, Ligo, Epigenomics, Water body simulation… • Common problems of workflow on cloud – Workflow scheduling, workflow ensemble execution, follow-the-cost… Montage Ligo 27
  26. Workflow Optimization Challenges • Various workflow structures and behaviors. •

    User-defined goals and constraints on budget/performance. • Cloud heterogeneityÆ Over 5 times difference in the monetary cost for workflow execution, while satisfying the same deadline. • Cloud dynamicsÆ Dynamics in performance and monetary cost of workflow executions. We need a workflow management system to abstract those complexities and improve the performance/monetary cost optimizations. 28
  27. Formulating Workflow Optimizations as Constrained Optimization Problems • Observation: many

    of workflow resource provisioning problems can be formulated as constrained optimization problems. • The Workflow Scheduling problem decides the VM type for each task so that the monetary cost is minimized for a deadline constraint. – Optimization variable • : 1 if task is assigned to instance type and 0 otherwise – Optimization goal: minimize the monetary cost • min × × , where = × =0 – Constraint: probabilistic deadline requirement • ≤ ≥ , where = × ∈ Average execution time of task on instance type . Unit time price of instance type . Overall execution time of the workflow. 29
  28. Deco: A Declarative Optimization Engine • A declarative programming language

    WLog (extended from ProLog) – Support probabilistic notion of performance/cost to capture cloud dynamics – Keywords for optimization goals, constraints and variables. – Workflow- and cloud-specific facts for programmability. • import(daxfile) and import(cloud) • Taming the large search space – An A*-search strategy to evaluate different VM types. – A GPU-accelerated solver • Integration into Pegasus 30
  29. WLog Program Example: Workflow Scheduling Problem import(amazonec2). import(montage). goal minimize

    Ct in totalcost(Ct). cons deadline(95%, 10h). var configs(Tid,Vid,Con) forall task(Tid) and Vm(Vid). r1 path(X,Y,Y,Tp) :- edge(X,Y), exetime(X,Vid,T), configs(X,Y,Y,Tp), Con==1, Tp is T. r2 path(X,Y,Z,Tp) :- edge(X,Z), Z\==Y, path(Z,Y,Z2,T1), exetime(X,Vid,T), configs(X,Vid,Con), Con==1, C is T+T1. r3 maxtime(Path,T) :- setof([Z,T1], path(root,tail,Z,T1), Set), max(Set,[Path,T]). r4 cost(Tid,Vid,C) :- config(Tid, Vid, Con),price(Vid,Up), exetime(Tid,Vid,T), C is T*Up*Con. r5 totalcost(Ct) :- findall(C, cost(Tid,Vid,C), Bag), sum(Bag,Ct). ProLog conventions • A set of declarative rules, each in the form h :- c1, c2, …, cn. • Built-in predicates, e.g., is, setoff. 1. Import the cloud- and workflow-related facts. 2. Specify the optimization goal, constraints and variable of the problem. 3. Specify the derivation rules • r1 to r3 calculate the overall execution time of a workflow to check the deadline constraint. • r4 and r5 calculate the overall monetary cost to evaluate the optimization goal. 31
  30. System Architecture of Deco • Integration into a Workflow Management

    System called Pegasus, where Deco works as a user-defined scheduler. 32
  31. Evaluation Results of Workflow Scheduling on Amazon EC2 • Monetary

    cost reductions (up to 52%) with the same deadline settings, in comparison with Autoscaling [Mao et al. SC12]. • GPU accelerations achieve over 10X speedup on the optimization engine. Workflow Montage-1 Montage-4 Montage-8 Speed-up by GPU 12X 10X 20X Workflow size Montage-1 Montage-4 Montage-8 Monetary cost reduction 25-40% 45-50% 48-52% 33
  32. Outline • Motivations – Emerging HPC techniques – 3 ANYs

    in big data • Our experience on building high-performance data management systems – GPGPU for real-time data analytics – Scalable and efficient cloud infrastructures for big data • Summary • Ongoing and future work 34
  33. Summary • (Big) data management systems continue to be a

    challenging and exciting research area. • Our experiences demonstrate the system insights of performance and programmability in developing high-performance data management systems on HPC architectures. • Towards pervasive HPC for big data: anyone can leverage HPC to tame the big data challenges anytime and anywhere. 35
  34. Ongoing and future work • Main themes – Parallelism and

    heterogeneity continue to be the major research focus. – Besides performance and programmability, other system issues also matter (e.g., energy consumption, availability, reliability, …). • Some interesting directions – Emerging processor/accelerator techniques – New memory techniques – Future cloud computing systems – Emerging data-intensive applications 36
  35. Approximate Hardware • Approximate hardware can trade off the accuracy

    of results for increased performance, reduced energy consumption, or both. • Existing studies focus on how to offer approximate computing based on approximate hardware. • We ask one radical question: can we use approximate hardware to accelerate precise computing? • Our preliminary studies: – Design for hybrid hardware (including both precise hardware and approximate hardware). – An approximate-and-refine execution paradigm. More details are outlined in my VLDB’14 vision paper. 39
  36. An Example: Approximate Storage Can Improve Merge Sort 10.5 10.1

    1.1 12.5 5.7 8.5 4.3 8.2 (a) Sort on precise storage r1 r2 r3 r4 r5 r6 r7 r8 1.1 5.7 10.5 10.1 r2 r1 8.5 r5 r6 4.3 8.2 r7 r8 12.5 r3 r4 12.5 r1 10.5 10.1 1.1 r3 r2 r4 5.7 8.2 4.3 8.5 r7 r5 r8 r6 10.5 10.1 1.1 12.5 5.7 8.2 4.3 8.5 r3 r7 r5 r8 r6 r2 r1 r4 10.5 10.1 1.1 12.5 5.7 8.5 4.3 8.2 (b) Sort on hybrid storage (Precise+ Approximate) r1 r2 r3 r4 r5 r6 r7 r8 10.4 10.0 1.2 12.4 5.8 8.4 4.4 8.3 r2 r1 r3 r4 r5 r6 r7 r8 10.3 9.9 1.0 12.2 5.7 8.4 4.4 8.3 r3 r2 r1 r4 r7 r5 r8 r6 10.2 9.8 0.9 12.4 5.5 8.2 4.4 8.4 r3 r7 r5 r6 r8 r2 r1 r4 10.5 10.1 1.1 12.5 5.7 8.2 4.3 8.5 r3 r7 r5 r8 r6 r2 r1 r4 Precise storage Approximate storage • On NVRAM, writes can be much slower than reads. • Writes on approximate storage can be three times faster than those on precise storage. refine approximate 40
  37. Acknowledgement • Singapore funds (over 3.8M SGD) – NTU startup

    – NTU interdisciplinary strategic fund – MoE (Ministry of Education, two Tier-2 grants) – NRF (National Research Foundation) • Industrial partners – Microsoft Research (Asia and Redmond) – Amazon Corp. – NVIDIA Corp. 41
  38. Our Research Group: Xtra Computing Group • Started in 2010

    when I joined NTU. – Since then, we have hosted over 10 research staffs, 11 Ph.D. students, 10 visiting students, and 3 visiting faculties (over two weeks’ time). – Collaborations a number of faculties from SCE, EEE, CEE and NBS within NTU. – Collaborations with oversea universities and companies. • Our mission is to build faster, greener and cheaper computing systems. • More about Xtra Computing Group: http://pdcc.ntu.edu.sg/xtra/ 42
  39. When Cloud Meets Water • Cloud Assisted Large-scale and Real-time

    Water Quality Monitoring (funded by Singapore NRF) Pattern of interest Search results Physical World Cyber World Information World Human World • A Sensor+Cloud paradigm • Cloud is the main infrastructure for enabling large-scale and real-time water quality monitoring. 43
  40. Impacts • Research contributions (on the cloud part) – Elastic

    computation management (heavy rain vs. Sunny days) – Parallelization of simulation models. • Practice to PUB (Public Utility Board) – Enable real-time monitoring and simulations for water quality monitoring for PUB – Enable real-time decision making for ABC operations in PUB. 44