Slide 1

Slide 1 text

1 High Performance Computing Adam DeConinck R Systems NA, Inc.

Slide 2

Slide 2 text

2 Development of models begins at small scale. Working on your laptop is convenient, simple. Actual analysis, however, is slow.

Slide 3

Slide 3 text

3 Development of models begins at small scale. Working on your laptop is convenient, simple. Actual analysis, however, is slow. “Scaling up” typically means a small server or fast multi-core desktop. Speedup exists, but for very large models, not significant. Single machines don't scale up forever.

Slide 4

Slide 4 text

4 For the largest models, a different approach is required.

Slide 5

Slide 5 text

5 High-Performance Computing involves many distinct computer processors working together on the same calculation. Large problems are divided into smaller parts and distributed among the many computers. Usually clusters of quasi-independent computers which are coordinated by a central scheduler.

Slide 6

Slide 6 text

6 Typical HPC Cluster Scheduler File Server Ethernet network High-speed network (10GigE / Infiniband) Computes Login External connection

Slide 7

Slide 7 text

7  Performance test: stochastic finance model on R Systems cluster  High-end workstation: 8 cores. Maximum speedup of 20x: 4.5 hrs → 14 minutes  Scale-up heavily model-dependent: 5x – 100x in our tests, can be faster  No more performance gain after ~500 cores: why? Some operations can't be parallelized.  Additional cores? Run multiple models simultaneously Performance gains Number of cores Duration (s) High-end workstation

Slide 8

Slide 8 text

8 Performance comes at a price: complexity.  New paradigm: real-time analysis vs batch jobs.  Applications must be written specifically to take advantage of distributed computing.  Performance characteristics of applications change.  Debugging becomes more of a challenge.

Slide 9

Slide 9 text

9 New paradigm: real-time analysis vs batch jobs. Most small analyses are done in real time:  “At-your-desk” analysis  Small models only  Fast iterations  No waiting for resources Large jobs are typically done in a batch model:  Submit job to a queue  Much larger models  Slow iterations  May need to wait

Slide 10

Slide 10 text

10 Applications must be written specifically to take advantage of distributed computing.  Explicitly split your problem into smaller “chunks”  “Message passing” between processes  Entire computation can be slowed by one or two slow chunks  Exception: “embarrassingly parallel” problems  Easy-to-split, independent chunks of computation  Thankfully, many useful models fall under this heading. (e.g. stochastic models) “Embarrassingly parallel” = No inter-process communication

Slide 11

Slide 11 text

11 Performance characteristics of applications change. On a single machine:  CPU speed (compute)  Cache  Memory  Disk On a cluster:  Single-machine metrics  Network  File server  Scheduler contention  Results from other nodes

Slide 12

Slide 12 text

12 Debugging becomes more of a challenge.  More complexity = more pieces that can fail  Race conditions: sequence of events no longer deterministic  Single nodes can “stall” and slow the entire computation  Scheduler, file server, login server all have their own challenges

Slide 13

Slide 13 text

13 External resources  One solution to handling complexity: outsource it!  Historical HPC facilities: universities, national labs  Often have the most absolute compute capacity, and will sell excess capacity  Competition with academic projects, typically do not include SLA or high-level support  Dedicated commercial HPC facilities providing “on-demand” compute power.

Slide 14

Slide 14 text

14 External HPC  Outsource HPC sysadmin  No hardware investment  Pay-as-you-go  Easy to migrate to new tech Internal HPC  Requires in-house expertise  Major investment in hardware  Possible idle time  Upgrades require new hardware

Slide 15

Slide 15 text

15 External HPC  No guaranteed access  Security arrangements complex  Limited control of configuration  Some licensing complex  Outsource HPC sysadmin  No hardware investment  Pay-as-you-go  Easy to migrate to new tech Internal HPC  No external contention  All internal—easy security  Full control over configuration  Simpler licensing control  Requires in-house expertise  Major investment in hardware  Possible idle time  Upgrades require new hardware

Slide 16

Slide 16 text

16 “The Cloud”  “Cloud computing”: virtual machines, dynamic allocation of resources in an external resource  Lower performance (virtualization), higher flexibility  Usually no contracts necessary: pay with your credit card, get 16 nodes  Often have to do all your own sysadmin  Low support, high control

Slide 17

Slide 17 text

17 CASE STUDY: Windows cluster for Actuarial Application

Slide 18

Slide 18 text

18 Global insurance company  Needed 500-1000 cores on a temporary basis  Preferred a utility, “pay-as-you-go” model  Experimenting with external resources for “burst” capacity during high-activity periods  Commercially licensed and supported application  Requested a proof of concept

Slide 19

Slide 19 text

19 Cluster configuration  Application embarrassingly parallel, small-to-medium data files, computationally and memory-intensive  Prioritize computation (processors), access to fileserver over inter-node communication, large storage  Upgraded memory in compute nodes to 2 GB/core  128-node cluster: 3.0 GHz Intel Xeon processors, 8 cores per node for 1024 cores total  Windows 2008 HPC R2 operating system  Application and fileserver on login node

Slide 20

Slide 20 text

20 Stumbling blocks  Application optimization Customer had a wide variety of models which generated different usage patterns. (IO, compute, memory-intensive jobs) Required dynamic reconfiguration for different conditions.  Technical issue Iterative testing process. Application turned out to be generating massive fileserver contention. Had to make changes to both software and hardware.  Human processes Users were accustomed to internal access model. Required changes both for providers (increase ease-of-use) and users (change workflow)  Security Customer had never worked with an external provider before. Complex internal security policy had to be reconciled with remote access.

Slide 21

Slide 21 text

21 Lessons learned:  Security was the biggest delaying factor. The initial security setup took over 3 months from the first expression of interest, even though cluster setup was done in less than a week.  Only mattered the first time though: subsequent runs started much more smoothly.  A low-cost proof-of-concept run was important to demonstrate feasibility, and for working the bugs out.  A good relationship with the application vendor was extremely important to solving problems and properly optimizing the model for performance.

Slide 22

Slide 22 text

22 Recent developments: GPUs

Slide 23

Slide 23 text

23 Graphics processing units  CPU: complex, general-purpose processor  GPU: highly-specialized parallel processor, optimized for performing operations for common graphics routines  Highly specialized → many more “cores” for same cost and space  Intel Core i7: 4 cores @ 3.4 GHz: $300 = $75/core  NVIDIA Tesla M2070: 448 cores @ 575 MHz: $4500 = $10/core  Also higher bandwidth: 100+ GB/s for GPU vs 10-30 GB/s for CPU  Same operations can be adapted for non-graphics applications: “GPGPU” Image from http://blogs.nvidia.com/2009/12/whats-the-difference-between-a-cpu-and-a-gpu/

Slide 24

Slide 24 text

24 HPC/Actuarial using GPUs  Random-number generation  Finite-difference modeling  Image processing  Numerical Algorithms Group: GPU random-number generator  MATLAB: operations on large arrays/matrices  Wolfram Mathematica: symbolic math analysis Data from http://www.nvidia.com/object/computational_finan ce.html