Iterative sparse matrix partitioning

Slide 1

Slide 1 text

1 Iterative sparse matrix partitioning Supervisor: Prof. dr. Rob H. Bisseling Davide Taviani October 7th, 2013

Slide 2

Slide 2 text

2 Parallel sparse matrix-vector multiplication At the core of many iterative solvers (e.g. conjugate gradient method) lies a simple operation: sparse matrix-vector multiplication. Given: m × n sparse matrix A (N nonzeros, N mn) n × 1 vector v we want to compute u = Av

Slide 3

Slide 3 text

3 Parallel sparse matrix-vector multiplication Usually A is fairly large and a lot of computations are required: O(mn) following the deﬁnition of matrix-vector multiplication; O(N) only considering the nonzero elements. We split the computations among p processors to improve speed. We make a partition of the set of the nonzeros of A, obtaining p disjoint sets A0 , . . . , Ap−1 . Furthermore, also the input vector v and the ﬁnal output u can be divided among those p processors (their distribution might not necessarily be the same).

Slide 4

Slide 4 text

4 Matrix partitioning Example of a partition of a 9 × 9 matrix with 18 nonzeros, with p = 2.

Slide 5

Slide 5 text

5 Matrix partitioning Local view of the matrix for every processor: P(0) P(1)

Slide 6

Slide 6 text

6 Parallel matrix-vector multiplication algorithm Parallel sparse matrix-vector multiplication is made (essentially) by 3 phases: I) fan-out II) local multiplication III) fan-in

Slide 7

Slide 7 text

7 Parallel matrix-vector multiplication algorithm A is partitioned along with u and v u v A

Slide 8

Slide 8 text

7 Parallel matrix-vector multiplication algorithm Fan-out: each processor receives the required elements of v from the others (according to its distribution) u v A

Slide 9

Slide 9 text

7 Parallel matrix-vector multiplication algorithm Local multiplication: where the actual computation is performed u v A

Slide 10

Slide 10 text

7 Parallel matrix-vector multiplication algorithm Fan-in: where each processor sends his contributions to the other processors according to the distribution of u u v A

Slide 11

Slide 11 text

8 Matrix partitioning To optimize this process: I and III involve communication: it has to be minimized II is a computation step: we need balance in the size of the partitions Optimization problem: partition the nonzeros such that the balance constraint is satisﬁed and the communication volume is minimized.

Slide 12

Slide 12 text

9 Matrix partitioning As a last example, a 6 × 6 “checkerboard” matrix:

Slide 13

Slide 13 text

10 Matrix partitioning Two diﬀerent partitionings result in extremely diﬀerent communication volumes. (a) Rows and columns are not split, therefore there is no need for communication. (b) Every row and column is split and causes communication during fan-in and fan-out.

Slide 14

Slide 14 text

11 Hypergraph partitioning Exact modeling of the matrix partitioning problem through hypergraph partitioning. A partition of a hypergraph is simply the partition of the set of vertices V into V0 , . . . , Vp−1 . A hyperedge e = {v1 , . . . , vk } is cut if two of its vertices belong to diﬀerent sets of the partition.

Slide 15

Slide 15 text

12 Hypergraph partitioning Hypergraph: a graph in which a hyperedge can connect more than two vertices (i.e. a subset of the vertex set V ) v1 v2 v4 v5 v3 v7 v6 e1 = {v1 , v3 , v4 } e2 = {v4 , v7 } e3 = {v5 , v7 } e4 = {v2 }

Slide 16

Slide 16 text

13 Hypergraph partitioning There are several models to translate the matrix partitioning to hypergraph partitioning: 1-dimensional row-net: each column of A is a vertex in the hypergraph, each row a hyperedge. If aij = 0, then column Ai is placed in the hyperedge j. column-net: identical to the previous one, with the roles of columns and rows exchanged As hypergraph partitioning consists in assignment of the vertices, columns/rows are uncut. Advantage of eliminating completely one source of communication, but being 1-dimensional is often a too strong restriction.

Slide 17

Slide 17 text

14 Hypergraph partitioning 2-dimensional ﬁne grain: nonzeros of A are vertices, rows and columns are hyperedges. The nonzero aij is placed in the hyperedges i and j A lot of freedom in partitioning (each nonzero can be assigned individually), but the size of the hypergraph (N vertices) is often too large. medium grain: middle ground between 1-dimensional models and ﬁne-grain Good compromise between the size of the hypergraph and freedom during the partitioning.

Slide 18

Slide 18 text

15 Medium grain (Daniel M. Pelt and Rob Bisseling, 2013, to appear) A

Slide 19

Slide 19 text

15 Medium grain (Daniel M. Pelt and Rob Bisseling, 2013, to appear) A Ac Ar Initial split of A into Ac and Ar

Slide 20

Slide 20 text

15 Medium grain (Daniel M. Pelt and Rob Bisseling, 2013, to appear) A Ac Ar B Initial split of A into Ac and Ar Construction of the (m + n) × (m + n) matrix B (with dummy diagonal elements)

Slide 21

Slide 21 text

16 Medium grain

Slide 22

Slide 22 text

16 Medium grain row-net model Partitioning of B with the row-net model (columns are kept together)

Slide 23

Slide 23 text

16 Medium grain row-net model Partitioning of B with the row-net model (columns are kept together)

Slide 24

Slide 24 text

17 Medium grain B

Slide 25

Slide 25 text

17 Medium grain Ac Ar B Retrieval of Ar and Ac with the new partitioning

Slide 26

Slide 26 text

17 Medium grain A Ac Ar B Retrieval of Ar and Ac with the new partitioning Reassembling of A

Slide 27

Slide 27 text

18 Medium grain A Clusters of nonzeros are grouped together: in Ar we kept together elements of the same row; in Ac elements of the same column.

Slide 28

Slide 28 text

19 Research directions Two research directions: Improving the initial partitioning of A Development of a fully iterative scheme: lowering the communication value by using information on the previous partitioning These directions can be combined: we can try to ﬁnd eﬃcient ways of splitting A into Ar and Ac , distinguishing between: partition-oblivious heuristics: no prior information is required partition-aware heuristics: requirement of A already partitioned

Slide 29

Slide 29 text

20 General remarks A few general principles to guide us in the construction of the heuristics: short rows/columns (w.r.t. the number of nonzeros) are more likely to be uncut in a good partitioning if a row/column is uncut, the partitioner decided at the previous iteration that it was convenient to do so. We shall try to keep, as much as possible, those rows/columns uncut again.

Slide 30

Slide 30 text

21 Individual assignment of nonzeros A simple heuristic is the extension of the original algorithm used in medium-grain. Partition-oblivious version: for all aij ∈ A do if nzr (i) < nzc (j) then assign aij to Ar else if nzc (j) < nzr (i) then assign aij to Ac else assign aij to according to tie-breaker end if end for

Slide 31

Slide 31 text

22 Individual assignment of nonzeros Ac Ar tie-breaking: Ac

Slide 32

Slide 32 text

22 Individual assignment of nonzeros Ac Ar tie-breaking: Ac

Slide 33

Slide 33 text

23 Individual assignment of nonzeros Partition-aware version: for all aij ∈ A do if row i is uncut and column j is cut then assign aij to Ar else if row i is cut and column j is uncut then assign aij to Ac else assign aij as in the partition-oblivious variant end if end for

Slide 34

Slide 34 text

24 Individual assignment of nonzeros Ac Ar tie-breaking: Ar

Slide 35

Slide 35 text

24 Individual assignment of nonzeros Ac Ar tie-breaking: Ar

Slide 36

Slide 36 text

24 Individual assignment of nonzeros Ac Ar tie-breaking: Ar

Slide 37

Slide 37 text

24 Individual assignment of nonzeros Ac Ar tie-breaking: Ar

Slide 38

Slide 38 text

25 Assignment of blocks of nonzeros Separated Block Diagonal (SBD) form of a partitioned matrix: we separate uncut and cut rows and columns. 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 SBD 2 4 6 0 1 5 3 7 8 5 7 8 0 1 2 3 4 6

Slide 39

Slide 39 text

26 Assignment of blocks of nonzeros The SBD form is a 3 × 3 block matrix   ˙ A00 ˙ A01 ˙ A10 ˙ A11 ˙ A12 ˙ A21 ˙ A22   ˙ A01 , ˙ A10 , ˙ A12 , ˙ A21 can be easily assigned in our framework.

Slide 40

Slide 40 text

26 Assignment of blocks of nonzeros   Ar /Ac Ar Ac M Ac Ar Ar /Ac   Ac Ar ˙ A01 , ˙ A10 , ˙ A12 , ˙ A21 can be easily assigned in our framework. Ar /Ac means that the size of the block determines whether it is assigned to Ar or Ac ; the nonzeros in the middle block are assigned individually (M stands for “mixed” assignment)

Slide 41

Slide 41 text

27 Assignment of blocks of nonzeros 2 4 6 0 1 5 3 7 8 5 7 8 0 1 2 3 4 6

Slide 42

Slide 42 text

27 Assignment of blocks of nonzeros 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 We reverse the permutations of rows and columns, obtaining A back, with new assignment.

Slide 43

Slide 43 text

28 Assignment of blocks of nonzeros Separated Block Diagonal form of order 2 (SBD2) of a matrix: we split the top, bottom, left and right blocks, separating the empty and nonempty parts. 2 4 6 0 1 5 3 7 8 5 7 8 0 1 2 3 4 6 2 6 4 0 1 5 8 3 7 7 8 5 0 1 2 6 3 4

Slide 44

Slide 44 text

29 Assignment of blocks of nonzeros The SBD2 form of a matrix is the following 5 × 5 block matrix:       ¨ A00 ¨ A01 ¨ A10 ¨ A11 ¨ A12 ¨ A21 ¨ A22 ¨ A23 ¨ A32 ¨ A33 ¨ A34 ¨ A43 ¨ A44       In this form, other than having information on nonzeros (rows/columns cut/uncut), we also have information on their neighbors (nonzeros in the same row and column).

Slide 45

Slide 45 text

29 Assignment of blocks of nonzeros       Ar Ar Ac Ar /Ac Ar Ac M Ac Ar Ar /Ac Ac Ar Ac       Ac Ar In this form, other than having information on nonzeros (rows/columns cut/uncut), we also have information on their neighbors (nonzeros in the same row and column).

Slide 46

Slide 46 text

30 Individual assignment of blocks of nonzeros 2 6 4 0 1 5 8 3 7 7 8 5 0 1 2 6 3 4

Slide 47

Slide 47 text

30 Individual assignment of blocks of nonzeros 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 We reverse the permutations of rows and columns, obtaining A back, with new assignment.

Slide 48

Slide 48 text

31 Partial assignment of rows and columns Main idea: Every time we assign a nonzero to either Ar or Ac , all the other nonzeros in the same row/column should be assigned to it as well, to prevent communication. Main issue: Hard to assign complete rows/column: a nonzero cannot be assigned to both Ar and Ac . We need to reason in terms of partial assignment: computation of a priority vector: a permutation of the indices {0, . . . , m + n − 1} (decreasing priority) {0, . . . , m − 1} correspond to rows; {m, . . . , m + n − 1} to columns. overpainting algorithm.

Slide 49

Slide 49 text

32 Overpainting algorithm Require: Priority vector v, matrix A Ensure: Ar , Ac Ar := Ac := ∅ for i = m + n − 1, . . . , 0 do if vi < m then Add the nonzeros of row i to Ar else Add the nonzeros of column i − m to Ac end if end for In this formulation of the algorithm, every nonzero is assigned twice; the algorithm is completely deterministic: Ar and Ac depend entirely on the priority vector v.