When HPC Meets Big Data: Emerging HPC Technologies for Real-Time Data Analytics

When HPC Meets Big Data: Emerging HPC Technologies for High-Performance
Data Management Systems Bingsheng He Nanyang Technological University 1

The Big Picture • Big data is not just big.
– High performance is a must, not an option. – “One size does not fit all”. • High performance computing (HPC) hardware & software architectures: parallelism and heterogeneity. – Scale up: multicore, many-core, . . . – Scale out: cluster, cloud, . . . – Heterogeneity is common in both hardware and software architectures. • We report our experience and insights on leveraging emerging HPC technologies for high- performance data management systems. 2

Outline • Motivations – Emerging HPC techniques – 3 ANYs
in big data • Our experience on building high-performance data management systems – GPGPU for real-time data analytics – Scalable and efficient cloud infrastructures for big data • Summary • Ongoing and future work 3

Emerging HPC Hardware: Parallelism and Heterogeneity • Towards many cores
• From CPU to accelerators (co-processors) Figures are adopted from Intel, NVIDIA and Altera. >> >> >> Dual cores Multi-core array Scalar plus many cores Many-core array GPU Xeon Phi FPGA 4

Emerging HPC Hardware: Parallelism and Heterogeneity (Cont’) • Towards tightly
coupled heterogeneous systems Figures are adopted from AMD, Intel and Altera. Intel-Altera Heterogeneous Accelerators AMD APU … 5

Cloud as Software Infrastructure: Parallelism and Heterogeneity • Pay-as-you-go virtual
cluster • Heterogeneous virtual machine (VM) offerings – Amazon offers 47 on- demand and 39 spot types. • Heterogeneity observations – Observation 1: VMs with the same type have very different actual computational capability, I/O and network performance. The distribution of the average VM-to- VM network bandwidth in a virtual cluster of 200 medium VMs on Amazon EC2 for one week. 6

Cloud as Software Infrastructure: Parallelism and Heterogeneity • Pay-as-you-go virtual
cluster • Heterogeneous virtual machine (VM) offerings – Amazon offers 47 on- demand and 39 spot types. • Heterogeneity observations – Observation 1: VMs with the same type have very different actual computational capability, I/O and network performance. – Observation 2: the I/O and network bandwidth of the same VM fluctuate significantly. Consecutive measurements on the same medium VM pair for one week. 7

3 ANYs in Big Data • From enterprises to anyone
– Internet of things, mobile, NGS (next gen sequencing)… • From structured data to any form – Data warehouse, text, streaming, graphs, JSON … • From SQL to any analytics/processing – MapReduce, R, eScience… “One size does not fit all” 8

When HPC Meets Big Data (Emerging) data-intensive applications Emerging hardware
& software architectures • Performance • Programmability • Energy consumption • User interfaces • … System issues: Vision: Pervasive HPC for big data Anyone can leverage HPC to tame the big data challenges anytime and anywhere. 9

Our Expeditions on Emerging HPC Technologies • GPGPU for real-time
data analytics [SIGMOD 08/11, SC07, PACT 08, VLDB 10/11/13 (1+2demo)/14/15 (2), TPDS (5) …] • Scalable and efficient cloud infrastructures for big data [HPDC15, SC14(2+1poster), ICS14, IPDPS14, SoCC 10/12, CIDR13, CLUSTER13, SIGMOD2010demo, TPDS (4), TCC (4) …] 11

When HPC Meets Big Data Performance requirement “Small” “Big” Data
footprint “Low” “High” “Trivial” Cluster and cloud computing GPGPU, and other emerging hardware Hardware Accelerated Cloud 12

GPU Accelerations • GPU has much higher memory bandwidth than
CPU. • Massive thread parallelism of GPU fits well for data parallel processing. Device memory GPU CPU Main memory P1 P2 Pn Multiprocessor 1 Local memory P1 P2 Pn Multiprocessor N Local memory PCI-E 14

NVIDIA GPUs Tesla K80 Tesla K40 Tesla K20 Tesla C2050
Stream Processors (Core) 2 x 2496 2880 2496 448 Core Clock 562MHz 745MHz 706MHz 1.15GHz Memory Clock 6GHz GDDR5 6GHz GDDR5 5.2GHz GDDR5 1.5GHz GDDR5 VRAM 2 x 12GB 12GB 5GB 3GB Single Precision 8.74 TFLOPS 4.29 TFLOPS 3.52 TFLOPS 1.03 TFLOPS 2 x 240GB/sec * GPU hardware power grows faster than Moore’s law. 15

Our Experiences in GPGPU-based Data Management Systems CUDA was released
in Feb. 2007 GPUQP (GDB) accepted in SIGMOD 2008 (“best papers”) Mars (GPU-based MapReduce) accepted in PACT 2008 (2nd top cited paper in PACT)* • http://arnetminer.org/conference/pact-124.html • Thanks to my advisor, colleagues, and students. Mars has been extended to AMD GPU and Hadoop (TPDS10) GDB supports compressed column- based processing (VLDB10) Medusa: GPU- based graph processing (TPDS13/14, VLDB13 best demo, CloudCom13) OmniDB: relational database on coupled CPU/GPU architectures (VLDB’13/14/15, VLDB’13 demo,…) Transaction executions on GDB (VLDB11) 16

Other Relevant Research on Emerging Architectures • OmniDB on coupled
CPU-GPU architectures (e.g., AMD APU) – Fine-grained query co-processing [VLDB2015]. – Portable query processing [VLDB 2013 demo] – Pipelining GPU query co-processing [in preparation] • PhiDB on Intel Xeon Phi – Improving hash join performance [VLDB 15] • ReconfigDB on OpenCL-based FPGA. – Improving hash join performance [FPL2015] – FPGA-aware aware database design [in preparation]. 17

OmniDB: Optimized GPU Query Co- Processing on Coupled CPU-GPU Architectures
Jiong He*, Shuhao Zhang*, Bingsheng He. In-Cache Query Co-Processing on Coupled CPU-GPU Architectures. PVLDB/VLDB 2015. Jiong He*, Mian Lu, Bingsheng He. Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture. PVLDB/VLDB 2013. http://pdcc.ntu.edu.sg/xtra/proj-jiong.html 18

The Coupled Architecture CPU GPU Cache Main Memory • Coupled
CPU-GPU architecture – Intel Sandy Bridge, AMD Fusion APU, etc. • New opportunities – Remove the PCI-e data transfer overhead – Enable fine-grained workload scheduling – Cache reuse 20

Challenges Come with Opportunities • Efficient data sharing among the
CPU and the GPU – Share main memory – Share last-level cache (LLC) • Keep both processors busy – The GPU cannot dominate the performance, since its capability is limited to the chip area. – How to assign suitable tasks to the CPU/GPU for maximum speedup. Our VLDB’13/15 papers study how to address those challenges with fine-grained co-processing mechanisms. 21

Fine-grained vs. Coarse-grained in Hash Joins PL outperforms OL &
DD significantly for both hash joins with and without partitioning (PHJ and SHJ, respectively). 0 0.5 1 1.5 2 2.5 3 SHJ PHJ Elapsed time (s) OL (GPU-only) DD PL (Fine-grained) 22

Our Experiences in Cloud-based Data Management Systems Cloud infrastructures [HotCloud’10,
ICPP11, SC14, ICS14, …] HPC Cloud [TPDS13, SC14…] Scientific Workflow [HPDC15, TCC13/14, CloudCom14 best Ph.D.] Big data management systems [SoCC10/12, CIDR13, CLUSTER13, TCC14…] Domain specific applications (e.g., water quality monitoring project) [IPSN14, SECON14 best demo] 25

Deco: A Declarative Optimization Engine for Resource Provisioning of Scientific
Workflows in IaaS Clouds Amelie Chi Zhou*, Bingsheng He, Xuntao Cheng*, Chiew Tong Lau, A Declarative Optimization Engine for Resource Provisioning of Scientific Workflows in IaaS Clouds. ACM HPDC 2015. [19 out of 116] http://pdcc.ntu.edu.sg/xtra/deco/index.html 26

Scientific Workflows as Big-Data Applications • Workflows may handle massive
input data. – Loosely-coupled via data dependency. – Tasks have very different I/O and computational behaviors. • Real-world workflows – Montage, Ligo, Epigenomics, Water body simulation… • Common problems of workflow on cloud – Workflow scheduling, workflow ensemble execution, follow-the-cost… Montage Ligo 27

Workflow Optimization Challenges • Various workflow structures and behaviors. •
User-defined goals and constraints on budget/performance. • Cloud heterogeneityÆ Over 5 times difference in the monetary cost for workflow execution, while satisfying the same deadline. • Cloud dynamicsÆ Dynamics in performance and monetary cost of workflow executions. We need a workflow management system to abstract those complexities and improve the performance/monetary cost optimizations. 28

Formulating Workflow Optimizations as Constrained Optimization Problems • Observation: many
of workflow resource provisioning problems can be formulated as constrained optimization problems. • The Workflow Scheduling problem decides the VM type for each task so that the monetary cost is minimized for a deadline constraint. – Optimization variable • : 1 if task is assigned to instance type and 0 otherwise – Optimization goal: minimize the monetary cost • min × × , where = × =0 – Constraint: probabilistic deadline requirement • ≤ ≥ , where = × ∈ Average execution time of task on instance type . Unit time price of instance type . Overall execution time of the workflow. 29

Deco: A Declarative Optimization Engine • A declarative programming language
WLog (extended from ProLog) – Support probabilistic notion of performance/cost to capture cloud dynamics – Keywords for optimization goals, constraints and variables. – Workflow- and cloud-specific facts for programmability. • import(daxfile) and import(cloud) • Taming the large search space – An A*-search strategy to evaluate different VM types. – A GPU-accelerated solver • Integration into Pegasus 30

WLog Program Example: Workflow Scheduling Problem import(amazonec2). import(montage). goal minimize
Ct in totalcost(Ct). cons deadline(95%, 10h). var configs(Tid,Vid,Con) forall task(Tid) and Vm(Vid). r1 path(X,Y,Y,Tp) :- edge(X,Y), exetime(X,Vid,T), configs(X,Y,Y,Tp), Con==1, Tp is T. r2 path(X,Y,Z,Tp) :- edge(X,Z), Z\==Y, path(Z,Y,Z2,T1), exetime(X,Vid,T), configs(X,Vid,Con), Con==1, C is T+T1. r3 maxtime(Path,T) :- setof([Z,T1], path(root,tail,Z,T1), Set), max(Set,[Path,T]). r4 cost(Tid,Vid,C) :- config(Tid, Vid, Con),price(Vid,Up), exetime(Tid,Vid,T), C is T*Up*Con. r5 totalcost(Ct) :- findall(C, cost(Tid,Vid,C), Bag), sum(Bag,Ct). ProLog conventions • A set of declarative rules, each in the form h :- c1, c2, …, cn. • Built-in predicates, e.g., is, setoff. 1. Import the cloud- and workflow-related facts. 2. Specify the optimization goal, constraints and variable of the problem. 3. Specify the derivation rules • r1 to r3 calculate the overall execution time of a workflow to check the deadline constraint. • r4 and r5 calculate the overall monetary cost to evaluate the optimization goal. 31

System Architecture of Deco • Integration into a Workflow Management
System called Pegasus, where Deco works as a user-defined scheduler. 32

Evaluation Results of Workflow Scheduling on Amazon EC2 • Monetary
cost reductions (up to 52%) with the same deadline settings, in comparison with Autoscaling [Mao et al. SC12]. • GPU accelerations achieve over 10X speedup on the optimization engine. Workflow Montage-1 Montage-4 Montage-8 Speed-up by GPU 12X 10X 20X Workflow size Montage-1 Montage-4 Montage-8 Monetary cost reduction 25-40% 45-50% 48-52% 33

Summary • (Big) data management systems continue to be a
challenging and exciting research area. • Our experiences demonstrate the system insights of performance and programmability in developing high-performance data management systems on HPC architectures. • Towards pervasive HPC for big data: anyone can leverage HPC to tame the big data challenges anytime and anywhere. 35

Ongoing and future work • Main themes – Parallelism and
heterogeneity continue to be the major research focus. – Besides performance and programmability, other system issues also matter (e.g., energy consumption, availability, reliability, …). • Some interesting directions – Emerging processor/accelerator techniques – New memory techniques – Future cloud computing systems – Emerging data-intensive applications 36

Thank you and Q&A http://pdcc.ntu.edu.sg/xtra/ 37

Backup Slides 38

Approximate Hardware • Approximate hardware can trade off the accuracy
of results for increased performance, reduced energy consumption, or both. • Existing studies focus on how to offer approximate computing based on approximate hardware. • We ask one radical question: can we use approximate hardware to accelerate precise computing? • Our preliminary studies: – Design for hybrid hardware (including both precise hardware and approximate hardware). – An approximate-and-refine execution paradigm. More details are outlined in my VLDB’14 vision paper. 39

An Example: Approximate Storage Can Improve Merge Sort 10.5 10.1
1.1 12.5 5.7 8.5 4.3 8.2 (a) Sort on precise storage r1 r2 r3 r4 r5 r6 r7 r8 1.1 5.7 10.5 10.1 r2 r1 8.5 r5 r6 4.3 8.2 r7 r8 12.5 r3 r4 12.5 r1 10.5 10.1 1.1 r3 r2 r4 5.7 8.2 4.3 8.5 r7 r5 r8 r6 10.5 10.1 1.1 12.5 5.7 8.2 4.3 8.5 r3 r7 r5 r8 r6 r2 r1 r4 10.5 10.1 1.1 12.5 5.7 8.5 4.3 8.2 (b) Sort on hybrid storage (Precise+ Approximate) r1 r2 r3 r4 r5 r6 r7 r8 10.4 10.0 1.2 12.4 5.8 8.4 4.4 8.3 r2 r1 r3 r4 r5 r6 r7 r8 10.3 9.9 1.0 12.2 5.7 8.4 4.4 8.3 r3 r2 r1 r4 r7 r5 r8 r6 10.2 9.8 0.9 12.4 5.5 8.2 4.4 8.4 r3 r7 r5 r6 r8 r2 r1 r4 10.5 10.1 1.1 12.5 5.7 8.2 4.3 8.5 r3 r7 r5 r8 r6 r2 r1 r4 Precise storage Approximate storage • On NVRAM, writes can be much slower than reads. • Writes on approximate storage can be three times faster than those on precise storage. refine approximate 40

Acknowledgement • Singapore funds (over 3.8M SGD) – NTU startup
– NTU interdisciplinary strategic fund – MoE (Ministry of Education, two Tier-2 grants) – NRF (National Research Foundation) • Industrial partners – Microsoft Research (Asia and Redmond) – Amazon Corp. – NVIDIA Corp. 41

Our Research Group: Xtra Computing Group • Started in 2010
when I joined NTU. – Since then, we have hosted over 10 research staffs, 11 Ph.D. students, 10 visiting students, and 3 visiting faculties (over two weeks’ time). – Collaborations a number of faculties from SCE, EEE, CEE and NBS within NTU. – Collaborations with oversea universities and companies. • Our mission is to build faster, greener and cheaper computing systems. • More about Xtra Computing Group: http://pdcc.ntu.edu.sg/xtra/ 42

When Cloud Meets Water • Cloud Assisted Large-scale and Real-time
Water Quality Monitoring (funded by Singapore NRF) Pattern of interest Search results Physical World Cyber World Information World Human World • A Sensor+Cloud paradigm • Cloud is the main infrastructure for enabling large-scale and real-time water quality monitoring. 43

Impacts • Research contributions (on the cloud part) – Elastic
computation management (heavy rain vs. Sunny days) – Parallelization of simulation models. • Practice to PUB (Public Utility Board) – Enable real-time monitoring and simulations for water quality monitoring for PUB – Enable real-time decision making for ABC operations in PUB. 44

When HPC Meets Big Data: Emerging HPC Technolog...

When HPC Meets Big Data: Emerging HPC Technologies for Real-Time Data Analytics

More Decks by SciTech

Other Decks in Technology

Featured

Transcript