olocos.pdf

Alex Pothen Purdue University CSCAPES Institute www.cs.purdue.edu/homes/apothen/ Assefaw Gebremedhin, Multithreaded
Algorithms for Graph Coloring 1

References  Multithreaded algorithms for graph coloring. Catalyurek, Feo, Gebremedhin,
Halappanavar and Pothen, 40 pp., Submitted to Parallel Computing.  New multithreaded ordering and coloring algorithms for multicore architectures. Patwary, Gebremedhin, Pothen, 12pp., EuroPar 2011.  Graph coloring for derivative computation and beyond: Algorithms, software and analysis. Gebremedhin, Nguyen, Pothen, and Patwary, 32 pp., Submitted to TOMS. 2

3 Graph Architecture Algorithm Thread Granularity, Synchronization Concurrency Latency Tolerance
Performance

Outline  The many-core and multi-threaded world ◦ Intel Nehalem
◦ Sun Niagara ◦ Cray XMT  A case study on multithreaded graph coloring ◦ An Iterative and Speculative Coloring Algorithm ◦ A Dataflow algorithm  RMAT graphs: ER, G, and B  Experimental results 4

Architectural Features O(|E|)-time implementations possible for all four • B
= max back degree over entire seq. • B+1 colors suffice to color G. Proc. Threads/C ore Cores/Socket Threads Cache Clock Multithreading, Other Detail Intel Nehalem 2 4 16 Shared L3 2.5 G Simultaneous, Cache Coher. protocol Sun Niagara 2 8 2 128 Shared L2 1.2 G Simultaneous Cray XMT 128 128 Procs. 16,384 None 500 M Interleaved, Fine-grained synchronization

Multithreaded: Iterative Greedy Algorithm 6 v Forbidden Colors v V
ci

Multi-threaded: Data Flow Algorithm 7

Multi-threaded: Data Flow Algorithm 8 v Forbidden Colors v V
ci

Moore’s Law and Performance 9 Figure: K. Olukotun, L. Hammond,
H. Sutter, and Burton Smith

RMAT Graphs  R-MAT: Recursive MATrix method  Experiments ◦
RMAT-ER (0.25, 0.25, 0.25, 0.25) ◦ RMAT-G (0.45, 0.15, 0.15, 0.25) ◦ RMAT-B (0.55, 0.15, 0.15, 0.15)  Chakrabarti, D. and Faloutsos, C. 2006. Graph mining: Laws, generators, and algorithms. ACM Comput. Surv. 38, 10

RMAT Graphs a b c d 11

Nehalem: Strong Scaling (Niagara) RMAT-ER RMAT-G RMAT-B 12

Cray XMT: Strong and Weak Scaling Iter-G Iter-B DF-G DF-B
13

Comparing Three Platforms a) ER c) Good e) Bad 14

No. Colors in Parallel Algorithms a) ER c) B b)
G 15

Computing SL Orderings in Parallel: RMAT-G graphs (Nehalem) 16 SL
Ordering Relaxed SL Ordering

Our contributions: Multithreaded Coloring Massive multithreadi ng 17

Computing Smallest Last Ordering: Strict and Relaxed Orderings on SC
graphs 18

Computing SL Orderings in Parallel: RMAT-B graphs 19

Computing SL Orderings in Parallel: Strict and Relaxed Orderings 20

Conflicts from Speculation: Three Platforms a) conflicts b) iterations c)
conflicts per iter. 21

A Renaissance in Architectures  Good news ◦ Moore’s Law
marches on  Bad news ◦ Power limits improvement in clock speed ◦ Memory accesses are the bottleneck for high-throughput computing  Major paradigm change, huge opportunity for innovation, eventual consequences are unclear  Current response: multi- and many-core processors 22

Architectural Wish List for Sparse Matrices/Graphs  Small messages with
low latency / high bandwidth  Latency tolerance  Light-weight synchronization mechanisms  Global address space ◦ No partitioning of the problem required ◦ Avoid memory-consuming profusion of ghost-nodes ◦ Correctness and performance easier  One Solution: Multi-threaded computations 23

Multi-threaded Parallelism 24

• Memory access times determine performance • By issuing multiple
threads, mask memory latency if a ready thread is available when a functional unit becomes free • Interleaved vs. Simultaneous multithreading (IMT or SMT) Figure from Robert Golla, Sun Time 25

Multi-core: Sun Niagara 2 • Two 8-core sockets, • 8
hw threads per core • 1.2 GHz processors linked by 8 x 9 crossbar to L2 cache banks • Simultaneous multithreading • Two threads from a core can be issued in a cycle • Shallow pipeline 26

Multicore: Intel Nehalem • Two quad-core sockets, 2.5 GHz •
Two hyperthreads per core support SMT • Off chip-data latency 106 cycles • Advanced architectural features: Cache coherence protocol to reduce traffic, loop-stream detection, improved branch prediction, out-of-order execution 27

Massive Multithreading: Cray XMT  Latency tolerance via massive multi-threading
◦ Context switch between threads in a single clock cycle ◦ Global address space, hashed to memory banks to reduce hot-spots ◦ No cache or local memory, average latency 600 cycles  Memory request doesn’t stall processor ◦ Other threads work while the request is fulfilled  Light-weight, word-level synchr. (full/empty bits)  Notes: ◦ 500 MHz clock ◦ 128 Hardware thread streams/proc., 28

Multithreaded Algorithms for Graph Coloring ◦ We developed two kinds
of multithreaded algorithms for graph coloring:  An iterative, coarse-grained method for generic shared-memory architectures  A dataflow algorithm designed for massively multithreaded architectures with hardware support for fine-grain synchronization, such as the Cray XMT ◦ Benchmarked the algorithms on three systems:  Cray XMT, Sun Niagara 2 and Intel Nehalem ◦ Excellent speedup observed on all three platforms 29

Coloring Algorithms 30

Greedy coloring algorithms 31  Distance-k, star, and acyclic coloring
are NP-hard  Approximating coloring to within O(n1-e) is NP-hard for any e>0 GREEDY(G=(V,E)) Order the vertices in V for i = 1 to |V| do Determine colors forbidden to vi Assign vi the smallest permissible color end-for  A greedy heuristic usually gives a near-optimal solution  The key is to find good orderings for coloring, and many have been developed Ref: Gebremedhin, Tarafdar, Manne, Pothen, SIAM J. Sci. Compt. 29:1042--1072, 2007.

Distance-1Coloring, Greedy Alg. a v a v 32

Many-core greedy coloring Given a graph, parallelize greedy coloring on
many-core machines such that Speedup is attained, and Number of colors is roughly same as in serial Difficult task since greedy is inherently sequential, computation small relative to 33

Parallel Coloring 34

Parallel Coloring: Speculation a v w a v w 35

Experimental results Iterative Dataflow Cray XMT: RMAT-G with 224, …,
227 vertices and 134M, …, 1B edges 36

Experimental results Niagara 2 Iterative Perf. With doubling threads on
a core = Doubling cores! 37

Experimental results RMAT-G with 224 = 16M vertices and 134M
edges All Platforms RMAT-B, 224 vertices,134M edges 38

Iterative Greedy Coloring: Multithreaded Algorithm Adj(v), color(w), forbidden(v): d(v) reads
each forbidden(v): d(v) writes Adj(v), color(w): d(v) reads each 39

Experimental results RMAT-G with 224 = 16M vertices and 134M
edges All Platforms 40

Tentative Conclusions, Future Work 41

Future Plans: Multithreaded Coloring Massive multithreadi ng 42

Thanks  Rob Bisseling, Erik Boman, Ümit Çatalürek, Karen Devine,
Florin Dobrian, John Feo, Assefaw Gebremedhin, Mahantesh Halappanavar, Bruce Hendrickson, Paul Hovland, Gary Kumfert, Fredrik Manne, Ali Pınar, Sivan Toledo, Jean Utke 43

Further reading www.cscapes.org  Gebremedhin and Manne, Scalable parallel graph
coloring algorithms, Concurrency: Practice and Experience, 12: 1131-1146, 2000.  Gebremedhin, Manne and Pothen, Parallel distance-k coloring algorithms for numerical optimization, Lecture Notes in Computer Science, 2400: 912-921, 2002.  Bozdag, Gebremedhin, Manne, Boman and Catalyurek. A framework for scalable 44

Combinatorial Scientific Computing and Petascale Simulations (CSCAPES) Institute  One
of four DOE Scientific Discovery thru Advanced Computing (SciDAC) Institutes (2006-2012); only one in Appl. Math ◦ Excellence in research, education and training ◦ Collaborations with science projects in SciDAC  Focus not on specific application, but on algorithms and software for combinatorial problems  Participants from Purdue, Sandia, Argonne, Ohio State, Colorado State  CSCAPES workshops with talks, tutorials on software, discussions on collaborations 45

CSCAPES Institute 46

Advertisements! • Journal of the ACM: New section on Scientific
and High Performance Computing… venue for publishing work of excellent quality in these areas that have a Computer Science component 47

Applications Complexity Increasing  Leading edge scientific applications increasingly include:
◦ Adaptive, unstructured data structures ◦ Complex, multiphysics simulations ◦ Multiscale computations in space and time ◦ Complex synchronizations (e.g., discrete events)  Significant parallelization challenges on today’s machines 48

Massive Mutithreading: Cray XMT • 128 processors, 500 MHz •
128 hw thread streams / proc. • in each cycle a proc. issues one ready thread • Deeply pipelined, M, A, C functional units • Cache-less Globally shared memory • Efficient hardware synchronziation via full/empty bits • Data mapped randomly in 8 Byte blocks, no locality • Average Memory latency 600 cycles 49

Multithreaded: Data Flow 50

Parallelizing greedy coloring  Goal: Given a distributed graph, parallelize
greedy coloring such that ◦ Speedup is attained ◦ Number of colors used is roughly same as in serial  Difficult task since greedy is inherently sequential, computation small relative to communication, and data accesses are irregular  D1 coloring: approaches based on Luby’s 51

Framework: Distributed-memory parallel Greedy Coloring  Exploit features of initial
data distribution ◦ Distinguish between interior and boundary vertices  Proceed in rounds, each having two phases: ◦ Tentative coloring ◦ Conflict detection 2 3 4 5 7 6 8 1 2 3 4 1 5 6 7 8 Superstep 1 Superstep 2 Communicate Communicate Superstep 1 Communicate Detect conflicts Round 1 Round 2 Detect conflicts 52

Specializations of the framework 53

Implementation and experimentation  Using the framework (JPDC, 2008) ◦
Designed specialized parallel algorithms for distance-1 coloring ◦ Experimentally studied how to tune “parameters” according to  size, density, and distribution of input graph  number of processors  computational platform  Extending the framework (SISC, under review) ◦ Designed parallel algorithms for D2 and restricted star coloring (to support Hessian computation) ◦ Designed parallel algorithms for D2 coloring of bipartite 54

D1-coloring, IBM Blue Gene/P 55

Our contributions: Parallel Coloring Global address- space machines ◦ Parallel
algorithms for distance-1 and distance-2 coloring on graphs for Jacobian and Hessian 56

Coloring Algorithms: Our Contributions  Serial algorithms and software ◦
Jacobian computation via distance-2 coloring algorithms on bipartite graphs ◦ Developed novel algorithms for acyclic, star, distance-k (k = 1,2) and other coloring problems; developed associated matrix recovery algorithms ◦ Delivered implementations via the software package ColPack (released Oct. 2008) ◦ Interfaced ColPack with the AD tool ADOL-C  Application Highlights ◦ Enabled Jacobian computation in Simulated Moving Beds 57

Further reading www.cscapes.org  Gebremedhin, Manne and Pothen. What color
is your Jacobian? Graph coloring for computing derivatives. SIAM Review 47(4):627—705, 2005.  Gebremedhin, Tarafdar, Manne and Pothen. New acyclic and star coloring algorithms with applications to computing Hessians. SIAM J. Sci. Comput. 29:1042—1072, 2007.  Gebremedhin, Pothen and Walther. 58

Applications Also Getting More Complex  Leading edge scientific applications
increasingly include: ◦ Adaptive, unstructured data structures ◦ Complex, multiphysics simulations ◦ Multiscale computations in space and time ◦ Complex synchronizations (e.g. discrete events)  Significant parallelization challenges on 59

Architectural Challenges for Graph Algorithms  Runtime is dominated by
latency ◦ Particularly true for data-centric applications ◦ Random accesses to global address space ◦ Perhaps many at once – fine-grained parallelism  Essentially no computation to hide access time  Access pattern is data dependent ◦ Prefetching unlikely to help ◦ Usually only want small part of cache line 60

a v 61

a v w 62

olocos.pdf

olocos.pdf

More Decks by luccasmaso

Featured

Transcript