# @ Parallel Matrix Multiplication

Discuss mapping the matrix multiplication algorithm from a linear network to a mesh network. Assume that the elements are already distributed through the network as needed, and that the algorithm is only concerned with calculating the matrix multiplication, not with routing the input matrix into the network. Assume static allocation elements of the matrix at the memories of the network nodes.

What is stored statically, and where is it stored ?

What would be the algorithm to calculate the matrix product with that data allocation ?

What is the memory overhead ?

What is the calculation overhead ?

What is the communication overhead ?

Traditional matrix multiply algorithm for:

A(N,M) X B(M,Z) = C(N,Z)

C=0

For i=1 to N Do:

For j=1 to Z Do:

For k=1 to M Do:

C(i,j) = C(i,j) + [ A(i,k) * B(k,j) ]

CONTINUE

CONTINUE

CONTINUE

Breaking this up into some functions:

Define Dot (Vector X, Vector Y):

M = X.Width

R = 0

For k=1 to M Do:

R = R + [ X(k) * Y(k) ]

CONTINUE

return R

Define VecMatProd(Vector X, Matrix Y):

Z = Y.Width

R = []

For j=1 to Z Do:

R(j) = Dot( X, Y(..., j) )

CONTINUE

return R

Then the algorithm becomes:

MMult(A,B):

N = A.Width

C = 0

For i=1 to N Do:

C(i, ...) = VecMatProd(A(i, ...), B)

CONTINUE

return C

On a single processor executing the algorithm:

Suppose we have N processors (indexed i=1 to N), each with its own private memory and interconnected in a linear array network with wrap-around. We discussed 3 ways to parallelize the matrix multiplication, **global knowledge**, **storage by rows of A and columns of B**, and **storage by rows for A, B** and C. Define a ‘stage’ of execution, as the period where all processors in the network fully receive all information inbound from any of their neighbors, perform a calculation, and send any information that they need to send to their neighbors.

**Global Knowledge:**

In global knowledge all N processors keep local copies of A,B and each produces a row of result matrix C. Aside from communicating the initial problem information, and the end result there is no communication between the processors.

Processor i will calculate row i of result matrix C.

At stage j (j ranges 1 to Z) processor i takes the dot product of the i’th row of A with the j’th column of B producing the value for C(i,j). After Z such stages each processor will contain the full i’th row of C completing the calculation. Each dot product takes O(M) instructions to compute, each processor calculates Z such dot products (to form a row of result matrix C). At the end of Z stages each of the N processors will have calculated 1 row of C, and result matrix C is complete.

**Rows of A, Columns of B:**

In storage by rows of A and columns of B, each processor i will permanently keep a copy of row i of matrix A in local memory. At the beginning of the first calculation, each processor i will also contain column i of matrix B. Suppose that at stage j (j ranges 1 to Z) processor i will have a copy of column k of matrix B stored locally, then processor i will take the dot product of row i of A and column k of B, calculating scalar C(i,k) (in O(M) adds/muls) , it will then transmit the index k, and the k’th column vector of B to its right neighbor, and receive a column vector q from its left neighbor. After Z such stages, each of the Z terms of the i’th row of matrix C has been calculated at processor i, and therefore the result matrix C has been calculated by the network.

**Rows of A, Rows of B:**

In storage by rows of A and rows of B, each processor i will permanently keep a copy of row i of matrix A in local memory, and will calculate row i of result matrix C (initialize this row to all 0’s). At the beginning of the first calculation, each processor i will also contain row i of matrix B. Suppose that at stage j (j ranges 1 to N) processor i will have a copy of row k of matrix B stored locally, then processor i will take the **scalar** product of the k‘th element of row i of A as the scalar and row k of B as the vector, then take the result vector and add it to the stored vector containing the partial sums of the terms for row i of result matrix C ( C(i,q) += A(i,k)*B(k,q) for q=1 to Z). Each such stage takes O(Z) adds/mul operation per processor. After each such stage each processor i will transmit index k of the row of B to its right neighbor, it will also transmit row k of B to its right neighbor. Processor i will at the same time as it is transmitting receive row q of matrix B from its left neighbor. After M such stages, each of the Z terms of the i’th row of matrix C has been calculated at processor i, and therefore the result matrix C has been calculated by the network.

**Extending These Algorithms to the Mesh**

Suppose we have N*Z processors (indexed <x=1,y=1> to <N,Z>), each with its own private memory and interconnected in a mesh network with wrap-around.

**Global Knowledge:**

Each processor with some index <x,y> will calculate the dot product of the x’th row of A, and the y’th column of B. This calculation will take O(M) mul/add instructions. The processor will then store the result at coordinates (i,j) of the result matrix C (C(i,j)). After 1 such calculation processor <x,y> will contain the value for element C(i,j) of result matrix C, and all that remains will be to collect and return matrix C from the mesh.

**Rows of A, Columns of B:**

The parallel algorithm for the mesh in this case is almost identical to the algorithm with global knowledge. Each processor with some index <x,y> will calculate the dot product of the x’th row of A, and the y’th column of B. This calculation will take O(M) mul/add instructions. The processor will then store the result at coordinates (i,j) of the result matrix C, (C(i,j)). After 1 such calculation processor <x,y> will contain the value for element C(i,j) of result matrix C, and all that remains will be to collect and return matrix C from the mesh. However we can make the observation that we don’t need to store the full matrix at each node. We will only store the x’th row of A, and the y’th column of B at processor with index <x,y>.

**Rows of A, Rows of B:**

Before explaining the algorithm for adapting this algorithm to the torus, I want to make some observations. It will be difficult to get a performance boost from a network modeling the structure of result matrix C using this algorithm if no constraints are put on the sizes of A and B. Suppose A is a 1 X M matrix, and B is a M X 1 matrix, clearly (M^2)*(M-1) primitive (add/mul) operations will be needed to calculate the result matrix C, but since N=1 and Z=1, our network is only 1 processor ! If we are parallelizing the multiply accumulate (rows of A rows of B) method we need to ensure a network whose size is related to the number of multiply and accumulate operations that need to be done. For this reason I make the assumption that N=Z=M (A and B are square matrices).

Each processor with some index <x,y> will contain the C(y,x) coordinate of the result matrix C. Initially each processor with some index <x,y> will contain values A[y,k] B[k,x], where k = Max( (N+2-x-y), (2N+2-x-y) ). At each step processor <x,y> will perform the calculation C(y,x) = C(y,x) + A[y,k]*B[k,x]. After this calculation processor <x,y> will send the value A[y,k] to the processor below it, and B[k,x] to the processor to its right. After N=M such stages the algorithm completes, and NXN result matrix C has been calculated.

This is the example solution of Thomas Leighton “Introduction to Parallel algorithms and Architectures. (pg 64-65)”.

Using Leightons solution of a template, I came up with a slightly different alternative that has the advantage that at the end of the algorithm processor <x,y> will contain cell element C(x,y) of the result matrix (rather than C(y,x) as in Leighton’s algorithm).

Again starting with a N X N matrix of processors define the initial data contents of the first column of processors as follows:

processor <x,y> will contain element A( x, ( (x+y-2) mod(N) )+1 ) from matrix A, and it will contain element B( ( (x+y-2) mod(N) )+1, y) from matrix B.

Suppose at some k’th stage of the algorithm processor <x,y> will contain element A[g,h] of A, and element B[s,w] of B. Then processor <x,y> will perform the calculation C(x,y) = C(x,y) + A(g,h) * B(s,w). After completing this calculation processor <x,y> will send A(g,h) to its east neighbor and B(s,w) to its south neighbor. After N such stages each processor <x,y> will contain the value C(x,y) of result matrix C.

In general I noticed that several similar algorithms could be devised with the only difference being the starting locations of the different elements of A and B.

The only requirement is that at each processor at each stage some result

C(i,j) = C(i,j) + A(i,k)*B(k,j) is calculated, and that throughout the algorithms execution each processor is calculating the result of some fixed C(i,j) while the receiving pairs A(i,k),B(k,j) for all k from 1 to N. Each k is served by one stage of the algorithm, and so in N stages each result of C(i,j) will be calculated for all i,j in N. I noted that if different processors calculated results for different C(i,j) values at different stages there will be both significant communications overhead, and a significantly more complicated algorithm needed to route all such partial sums to the correct node in the network.

All such algorithms will have the same overhead values:

**If in 1 dimension we can eliminate 1 loop from the classic matrix multiplication algorithm, can we eliminate 2 loops in 2D ?**

Yes, each of the above algorithm eliminates 2 loops.

**Can we eliminate n loops in n dimensions ?**

No, there will be a communications and data storage bottleneck. For example in the last example of the multiply-add-shift network if we were to add another dimension of processors we could still not do the operation in 1 instruction on the network because there will be the bottleneck of summing all the terms calculated in a single instruction on the network. Each N terms will need to be summed together and stored in the same memory location. Because N terms will need to be summed and stored in the same location that shows a natural bottleneck for the parallelization, the same memory location cannot take part in N additions simultaneously, and correctly.

It seems that there will be some degree of parallelization of a problem wherein virtually all the work is done by the communication in the network, and after which the addition of another dimension of processing power is unhelpful.

An example of another method for processing a multiplication on a linear array.

An example of the by rows and columns method for multiplication in a linear array of size 2.