### 8.3 Solving a System of Linear Equations

This section discusses the problem of solving a system of linear equations of the form

In matrix notation, this system is written as Ax = b. Here A is a dense n x n matrix of coefficients such that A[i, j] = ai,j, b is an n x 1 vector [b0, b1, ..., bn-1]T, and x is the desired solution vector [x0, x1, ..., xn-1]T. We will make all subsequent references to ai,j by A[i, j] and xi by x [i].

A system of equations Ax = b is usually solved in two stages. First, through a series of algebraic manipulations, the original system of equations is reduced to an upper-triangular system of the form

We write this as Ux = y, where U is a unit upper-triangular matrix - one in which all subdiagonal entries are zero and all principal diagonal entries are equal to one. Formally, U[i, j] = 0 if i > j, otherwise U[i, j] = ui,j. Furthermore, U[i, i] = 1 for 0 i < n. In the second stage of solving a system of linear equations, the upper-triangular system is solved for the variables in reverse order from x[n - 1] to x[0] by a procedure known as back-substitution (Section 8.3.3).

We discuss parallel formulations of the classical Gaussian elimination method for upper-triangularization in Sections 8.3.1 and 8.3.2. In Section 8.3.1, we describe a straightforward Gaussian elimination algorithm assuming that the coefficient matrix is nonsingular, and its rows and columns are permuted in a way that the algorithm is numerically stable. Section 8.3.2 discusses the case in which a numerically stable solution of the system of equations requires permuting the columns of the matrix during the execution of the Gaussian elimination algorithm.

Although we discuss Gaussian elimination in the context of upper-triangularization, a similar procedure can be used to factorize matrix A as the product of a lower-triangular matrix L and a unit upper-triangular matrix U so that A = L x U. This factorization is commonly referred to as LU factorization. Performing LU factorization (rather than upper-triangularization) is particularly useful if multiple systems of equations with the same left-hand side Ax need to be solved. Algorithm 3.3 gives a procedure for column-oriented LU factorization.

#### 8.3.1 A Simple Gaussian Elimination Algorithm

The serial Gaussian elimination algorithm has three nested loops. Several variations of the algorithm exist, depending on the order in which the loops are arranged. Algorithm 8.4 shows one variation of Gaussian elimination, which we will adopt for parallel implementation in the remainder of this section. This program converts a system of linear equations Ax = b to a unit upper-triangular system Ux = y. We assume that the matrix U shares storage with A and overwrites the upper-triangular portion of A. The element A[k, j] computed on line 6 of Algorithm 8.4 is actually U [k, j]. Similarly, the element A[k, k] equated to 1 on line 8 is U [k, k]. Algorithm 8.4 assumes that A[k, k] 0 when it is used as a divisor on lines 6 and 7.

##### Algorithm 8.4 A serial Gaussian elimination algorithm that converts the system of linear equations Ax = b to a unit upper-triangular system Ux = y. The matrix U occupies the upper-triangular locations of A. This algorithm assumes that A[k, k] 0 when it is used as a divisor on lines 6 and 7.
```1.   procedure GAUSSIAN_ELIMINATION (A, b, y)
2.   begin
3.      for k := 0 to n - 1 do           /* Outer loop */
4.      begin
5.         for j := k + 1 to n - 1 do
6.             A[k, j] := A[k, j]/A[k, k];  /* Division step */
7.         y[k] := b[k]/A[k, k];
8.         A[k, k] := 1;
9.         for i := k + 1 to n - 1 do
10.        begin
11.           for j := k + 1 to n - 1 do
12.               A[i, j] := A[i, j] - A[i, k] x A[k, j]; /* Elimination step */
13.           b[i] := b[i] - A[i, k] x y[k];
14.           A[i, k] := 0;
15.        endfor;          /* Line 9 */
16.     endfor;             /* Line 3 */
17.  end GAUSSIAN_ELIMINATION ```

In this section, we will concentrate only on the operations on matrix A in Algorithm 8.4. The operations on vector b on lines 7 and 13 of the program are straightforward to implement. Hence, in the rest of the section, we will ignore these steps. If the steps on lines 7, 8, 13, and 14 are not performed, then Algorithm 8.4 leads to the LU factorization of A as a product L x U. After the termination of the procedure, L is stored in the lower-triangular part of A, and U occupies the locations above the principal diagonal.

For k varying from 0 to n - 1, the Gaussian elimination procedure systematically eliminates variable x[k] from equations k + 1 to n - 1 so that the matrix of coefficients becomes upper-triangular. As shown in Algorithm 8.4, in the kth iteration of the outer loop (starting on line 3), an appropriate multiple of the kth equation is subtracted from each of the equations k + 1 to n - 1 (loop starting on line 9). The multiples of the kth equation (or the kth row of matrix A) are chosen such that the kth coefficient becomes zero in equations k + 1 to n - 1 eliminating x [k] from these equations. A typical computation of the Gaussian elimination procedure in the kth iteration of the outer loop is shown in Figure 8.5. The kth iteration of the outer loop does not involve any computation on rows 1 to k - 1 or columns 1 to k - 1. Thus, at this stage, only the lower-right (n - k) x (n - k) submatrix of A (the shaded portion in Figure 8.5) is computationally active.

##### Figure 8.5. A typical computation in Gaussian elimination.

Gaussian elimination involves approximately n2/2 divisions (line 6) and approximately (n3/3) - (n2/2) subtractions and multiplications (line 12). In this section, we assume that each scalar arithmetic operation takes unit time. With this assumption, the sequential run time of the procedure is approximately 2n3/3 (for large n); that is,

Equation 8.17

##### Parallel Implementation with 1-D Partitioning

We now consider a parallel implementation of Algorithm 8.4, in which the coefficient matrix is rowwise 1-D partitioned among the processes. A parallel implementation of this algorithm with columnwise 1-D partitioning is very similar, and its details can be worked out based on the implementation using rowwise 1-D partitioning (Problems 8.8 and 8.9).

We first consider the case in which one row is assigned to each process, and the n x n coefficient matrix A is partitioned along the rows among n processes labeled from P0 to Pn-1. In this mapping, process Pi initially stores elements A[i, j] for 0 j < n. Figure 8.6 illustrates this mapping of the matrix onto the processes for n = 8. The figure also illustrates the computation and communication that take place in the iteration of the outer loop when k = 3.

##### Figure 8.6. Gaussian elimination steps during the iteration corresponding to k = 3 for an 8 x 8 matrix partitioned rowwise among eight processes.

Algorithm 8.4 and Figure 8.5 show that A[k, k + 1], A[k, k + 2], ..., A[k, n - 1] are divided by A[k, k] (line 6) at the beginning of the kth iteration. All matrix elements participating in this operation (shown by the shaded portion of the matrix in Figure 8.6(a)) belong to the same process. So this step does not require any communication. In the second computation step of the algorithm (the elimination step of line 12), the modified (after division) elements of the kth row are used by all other rows of the active part of the matrix. As Figure 8.6(b) shows, this requires a one-to-all broadcast of the active part of the kth row to the processes storing rows k + 1 to n - 1. Finally, the computation A[i, j] := A[i, j] - A[i, k] x A[k, j] takes place in the remaining active portion of the matrix, which is shown shaded in Figure 8.6(c).

The computation step corresponding to Figure 8.6(a) in the kth iteration requires n - k - 1 divisions at process Pk. Similarly, the computation step of Figure 8.6(c) involves n - k - 1 multiplications and subtractions in the kth iteration at all processes Pi , such that k < i < n. Assuming a single arithmetic operation takes unit time, the total time spent in computation in the kth iteration is 3(n - k - 1). Note that when Pk is performing the divisions, the remaining p - 1 processes are idle, and while processes Pk + 1, ..., Pn-1 are performing the elimination step, processes P0, ..., Pk are idle. Thus, the total time spent during the computation steps shown in Figures 8.6(a) and (c) in this parallel implementation of Gaussian elimination is , which is equal to 3n(n - 1)/2.

The communication step of Figure 8.6(b) takes time (ts +tw(n -k -1)) log n (Table 4.1). Hence, the total communication time over all iterations is log n, which is equal to ts n log n + tw(n(n - 1)/2) log n. The overall parallel run time of this algorithm is

Equation 8.18

Since the number of processes is n, the cost, or the process-time product, is Q(n3 log n) due to the term associated with tw in Equation 8.18. This cost is asymptotically higher than the sequential run time of this algorithm (Equation 8.17). Hence, this parallel implementation is not cost-optimal.

Pipelined Communication and Computation We now present a parallel implementation of Gaussian elimination that is cost-optimal on n processes.

In the parallel Gaussian elimination algorithm just presented, the n iterations of the outer loop of Algorithm 8.4 execute sequentially. At any given time, all processes work on the same iteration. The (k + 1)th iteration starts only after all the computation and communication for the kth iteration is complete. The performance of the algorithm can be improved substantially if the processes work asynchronously; that is, no process waits for the others to finish an iteration before starting the next one. We call this the asynchronous or pipelined version of Gaussian elimination. Figure 8.7 illustrates the pipelined Algorithm 8.4 for a 5 x 5 matrix partitioned along the rows onto a logical linear array of five processes.

##### Figure 8.7. Pipelined Gaussian elimination on a 5 x 5 matrix partitioned with one row per process.

During the kth iteration of Algorithm 8.4, process Pk broadcasts part of the kth row of the matrix to processes Pk+1, ..., Pn-1 (Figure 8.6(b)). Assuming that the processes form a logical linear array, and Pk+1 is the first process to receive the kth row from process Pk. Then process Pk+1 must forward this data to Pk+2. However, after forwarding the kth row to Pk+2, process Pk+1 need not wait to perform the elimination step (line 12) until all the processes up to Pn-1 have received the kth row. Similarly, Pk+2 can start its computation as soon as it has forwarded the kth row to Pk+3, and so on. Meanwhile, after completing the computation for the kth iteration, Pk+1 can perform the division step (line 6), and start the broadcast of the (k + 1)th row by sending it to Pk+2.

In pipelined Gaussian elimination, each process independently performs the following sequence of actions repeatedly until all n iterations are complete. For the sake of simplicity, we assume that steps (1) and (2) take the same amount of time (this assumption does not affect the analysis):

1. If a process has any data destined for other processes, it sends those data to the appropriate process.

2. If the process can perform some computation using the data it has, it does so.

3. Otherwise, the process waits to receive data to be used for one of the above actions.

Figure 8.7 shows the 16 steps in the pipelined parallel execution of Gaussian elimination for a 5 x 5 matrix partitioned along the rows among five processes. As Figure 8.7(a) shows, the first step is to perform the division on row 0 at process P0. The modified row 0 is then sent to P1 (Figure 8.7(b)), which forwards it to P2 (Figure 8.7(c)). Now P1 is free to perform the elimination step using row 0 (Figure 8.7(d)). In the next step (Figure 8.7(e)), P2 performs the elimination step using row 0. In the same step, P1, having finished its computation for iteration 0, starts the division step of iteration 1. At any given time, different stages of the same iteration can be active on different processes. For instance, in Figure 8.7(h), process P2 performs the elimination step of iteration 1 while processes P3 and P4 are engaged in communication for the same iteration. Furthermore, more than one iteration may be active simultaneously on different processes. For instance, in Figure 8.7(i), process P2 is performing the division step of iteration 2 while process P3 is performing the elimination step of iteration 1.

We now show that, unlike the synchronous algorithm in which all processes work on the same iteration at a time, the pipelined or the asynchronous version of Gaussian elimination is cost-optimal. As Figure 8.7 shows, the initiation of consecutive iterations of the outer loop of Algorithm 8.4 is separated by a constant number of steps. A total of n such iterations are initiated. The last iteration modifies only the bottom-right corner element of the coefficient matrix; hence, it completes in a constant time after its initiation. Thus, the total number of steps in the entire pipelined procedure is Q(n) (Problem 8.7). In any step, either O(n) elements are communicated between directly-connected processes, or a division step is performed on O(n) elements of a row, or an elimination step is performed on O(n) elements of a row. Each of these operations take O(n) time. Hence, the entire procedure consists of Q(n) steps of O(n) complexity each, and its parallel run time is O(n2). Since n processes are used, the cost is O(n3), which is of the same order as the sequential complexity of Gaussian elimination. Hence, the pipelined version of parallel Gaussian elimination with 1-D partitioning of the coefficient matrix is cost-optimal.

Block 1-D Partitioning with Fewer than n Processes The preceding pipelined implementation of parallel Gaussian elimination can be easily adapted for the case in which n > p. Consider an n x n matrix partitIoned among p processes (p < n) such that each process is assigned n/p contiguous rows of the matrix. Figure 8.8 illustrates the communication steps in a typical iteration of Gaussian elimination with such a mapping. As the figure shows, the kth iteration of the algorithm requires that the active part of the kth row be sent to the processes storing rows k + 1, k + 2, ..., n - 1.

##### Figure 8.8. The communication in the Gaussian elimination iteration corresponding to k = 3 for an 8 x 8 matrix distributed among four processes using block 1-D partitioning.

Figure 8.9(a) shows that, with block 1-D partitioning, a process with all rows belonging to the active part of the matrix performs (n - k - 1)n/p multiplications and subtractions during the elimination step of the kth iteration. Note that in the last (n/p) - 1 iterations, no process has all active rows, but we ignore this anomaly. If the pipelined version of the algorithm is used, then the number of arithmetic operations on a maximally-loaded process in the kth iteration (2(n - k - 1)n/p) is much higher than the number of words communicated (n - k - 1) by a process in the same iteration. Thus, for sufficiently large values of n with respect to p, computation dominates communication in each iteration. Assuming that each scalar multiplication and subtraction pair takes unit time, the total parallel run time of this algorithm (ignoring communication overhead) is , which is approximately equal to n3/p.

##### Figure 8.9. Computation load on different processes in block and cyclic 1-D partitioning of an 8 x 8 matrix on four processes during the Gaussian elimination iteration corresponding to k = 3.

The process-time product of this algorithm is n3, even if the communication costs are ignored. Thus, the cost of the parallel algorithm is higher than the sequential run time (Equation 8.17) by a factor of 3/2. This inefficiency of Gaussian elimination with block 1-D partitioning is due to process idling resulting from an uneven load distribution. As Figure 8.9(a) shows for an 8 x 8 matrix and four processes, during the iteration corresponding to k = 3 (in the outer loop of Algorithm 8.4), one process is completely idle, one is partially loaded, and only two processes are fully active. By the time half of the iterations of the outer loop are over, only half the processes are active. The remaining idle processes make the parallel algorithm costlier than the sequential algorithm.

This problem can be alleviated if the matrix is partitioned among the processes using cyclic 1-D mapping as shown in Figure 8.9(b). With the cyclic 1-D partitioning, the difference between the computational loads of a maximally loaded process and the least loaded process in any iteration is of at most one row (that is, O(n) arithmetic operations). Since there are n iterations, the cumulative overhead due to process idling is only O(n2 p) with a cyclic mapping, compared to Q(n3) with a block mapping (Problem 8.12).

##### Parallel Implementation with 2-D Partitioning

We now describe a parallel implementation of Algorithm 8.4 in which the n x n matrix A is mapped onto an n x n mesh of processes such that process Pi,j initially stores A[i, j]. The communication and computation steps in the iteration of the outer loop corresponding to k = 3 are illustrated in Figure 8.10 for n = 8. Algorithm 8.4 and Figures 8.5 and 8.10 show that in the kth iteration of the outer loop, A[k, k] is required by processes Pk,k+1, Pk,k+2, ..., Pk,n-1 to divide A[k, k + 1], A[k, k + 2], ..., A[k, n - 1], respectively. After the division on line 6, the modified elements of the kth row are used to perform the elimination step by all the other rows in the active part of the matrix. The modified (after the division on line 6) elements of the kth row are used by all other rows of the active part of the matrix. Similarly, the elements of the kth column are used by all other columns of the active part of the matrix for the elimination step. As Figure 8.10 shows, the communication in the kth iteration requires a one-to-all broadcast of A[i, k] along the i th row (Figure 8.10(a)) for k i < n, and a one-to-all broadcast of A[k, j] along the j th column (Figure 8.10(c)) for k < j < n. Just like the 1-D partitioning case, a non-cost-optimal parallel formulation results if these broadcasts are performed synchronously on all processes (Problem 8.11).

##### Figure 8.10. Various steps in the Gaussian elimination iteration corresponding to k = 3 for an 8 x 8 matrix on 64 processes arranged in a logical two-dimensional mesh.

Pipelined Communication and Computation Based on our experience with Gaussian elimination using 1-D partitioning of the coefficient matrix, we develop a pipelined version of the algorithm using 2-D partitioning.

As Figure 8.10 shows, in the kth iteration of the outer loop (lines 3-16 of Algorithm 8.4), A[k, k] is sent to the right from Pk,k to Pk,k+1 to Pk,k+2, and so on, until it reaches Pk,n-1. Process Pk,k+1 performs the division A[k, k + 1]/A[k, k] as soon as it receives A[k, k] from Pk,k. It does not have to wait for A[k, k] to reach all the way up to Pk,n-1 before performing its local computation. Similarly, any subsequent process Pk,j of the kth row can perform its division as soon as it receives A[k, k]. After performing the division, A[k, j] is ready to be communicated downward in the j th column. As A[k, j] moves down, each process it passes is free to use it for computation. Processes in the j th column need not wait until A[k, j] reaches the last process of the column. Thus, Pi,j performs the elimination step A[i, j] := A[i, j] - A[i, k] x A[k, j] as soon as A[i, k] and A[k, j] are available. Since some processes perform the computation for a given iteration earlier than other processes, they start working on subsequent iterations sooner.

The communication and computation can be pipelined in several ways. We present one such scheme in Figure 8.11. In Figure 8.11(a), the iteration of the outer loop for k = 0 starts at process P0,0, when P0,0 sends A[0, 0] to P0,1. Upon receiving A[0, 0], P0,1 computes A[0, 1] := A[0, 1]/A[0, 0] (Figure 8.11(b)). Now P0,1 forwards A[0, 0] to P0,2 and also sends the updated A[0, 1] down to P1,1 (Figure 8.11(c)). At the same time, P1,0 sends A[1, 0] to P1,1. Having received A[0, 1] and A[1, 0], P1,1 performs the elimination step A[1, 1] := A[1, 1] - A[1, 0] x A[0, 1], and having received A[0, 0], P0,2 performs the division step A[0, 2] := A[0, 2]/A[0, 0] (Figure 8.11(d)). After this computation step, another set of processes (that is, processes P0,2, P1,1, and P2,0) is ready to initiate communication (Figure 8.11(e)).

##### Figure 8.11. Pipelined Gaussian elimination for a 5 x 5 matrix with 25 processes.

All processes performing communication or computation during a particular iteration lie along a diagonal in the bottom-left to top-right direction (for example, P0,2, P1,1, and P2,0 performing communication in Figure 8.11(e) and P0,3, P1,2, and P2,1 performing computation in Figure 8.11(f)). As the parallel algorithm progresses, this diagonal moves toward the bottom-right corner of the logical 2-D mesh. Thus, the computation and communication for each iteration moves through the mesh from top-left to bottom-right as a "front." After the front corresponding to a certain iteration passes through a process, the process is free to perform subsequent iterations. For instance, in Figure 8.11(g), after the front for k = 0 has passed P1,1, it initiates the iteration for k = 1 by sending A[1, 1] to P1,2. This initiates a front for k = 1, which closely follows the front for k = 0. Similarly, a third front for k = 2 starts at P2,2 (Figure 8.11(m)). Thus, multiple fronts that correspond to different iterations are active simultaneously.

Every step of an iteration, such as division, elimination, or transmitting a value to a neighboring process, is a constant-time operation. Therefore, a front moves a single step closer to the bottom-right corner of the matrix in constant time (equivalent to two steps of Figure 8.11). The front for k = 0 takes time Q(n) to reach Pn-1,n-1 after its initiation at P0,0. The algorithm initiates n fronts for the n iterations of the outer loop. Each front lags behind the previous one by a single step. Thus, the last front passes the bottom-right corner of the matrix Q(n) steps after the first one. The total time elapsed between the first front starting at P0,0 and the last one finishing is Q(n). The procedure is complete after the last front passes the bottom-right corner of the matrix; hence, the total parallel run time is Q(n). Since n2 process are used, the cost of the pipelined version of Gaussian elimination is Q(n3), which is the same as the sequential run time of the algorithm. Hence, the pipelined version of Gaussian elimination with 2-D partitioning is cost-optimal.

2-D Partitioning with Fewer than n2 Processes Consider the case in which p processes are used so that p < n2 and the matrix is mapped onto a mesh by using block 2-D partitioning. Figure 8.12 illustrates that a typical parallel Gaussian iteration involves a rowwise and a columnwise communication of values. Figure 8.13(a) illustrates the load distribution in block 2-D mapping for n = 8 and p = 16.

##### Figure 8.13. Computational load on different processes in block and cyclic 2-D mappings of an 8 x 8 matrix onto 16 processes during the Gaussian elimination iteration corresponding to k = 3.

Figures 8.12 and 8.13(a) show that a process containing a completely active part of the matrix performs n2/p multiplications and subtractions, and communicates words along its row and its column (ignoring the fact that in the last iterations, the active part of the matrix becomes smaller than the size of a block, and no process contains a completely active part of the matrix). If the pipelined version of the algorithm is used, the number of arithmetic operations per process (2n2/p) is an order of magnitude higher than the number of words communicated per process () in each iteration. Thus, for sufficiently large values of n2 with respect to p, the communication in each iteration is dominated by computation. Ignoring the communication cost and assuming that each scalar arithmetic operation takes unit time, the total parallel run time of this algorithm is (2n2/p) x n, which is equal to 2n3/p. The process-time product is 2n3, which is three times the cost of the serial algorithm (Equation 8.17). As a result, there is an upper bound of 1/3 on the efficiency of the parallel algorithm.

As in the case of a block 1-D mapping, the inefficiency of Gaussian elimination with a block 2-D partitioning of the matrix is due to process idling resulting from an uneven load distribution. Figure 8.13(a) shows the active part of an 8 x 8 matrix of coefficients in the iteration of the outer loop for k = 3 when the matrix is block 2-D partitioned among 16 processes. As shown in the figure, seven out of 16 processes are fully idle, five are partially loaded, and only four are fully active. By the time half of the iterations of the outer loop have been completed, only one-fourth of the processes are active. The remaining idle processes make the parallel algorithm much costlier than the sequential algorithm.

This problem can be alleviated if the matrix is partitioned in a 2-D cyclic fashion as shown in Figure 8.13(b). With the cyclic 2-D partitioning, the maximum difference in computational load between any two processes in any iteration is that of one row and one column update. For example, in Figure 8.13(b), n2/p matrix elements are active in the bottom-right process, and (n - 1)2/p elements are active in the top-left process. The difference in workload between any two processes is at most Q() in any iteration, which contributes Q() to the overhead function. Since there are n iterations, the cumulative overhead due to process idling is only Q with cyclic mapping in contrast to Q(n3) with block mapping (Problem 8.12). In practical parallel implementations of Gaussian elimination and LU factorization, a block-cyclic mapping is used to reduce the overhead due to message startup time associated with a pure cyclic mapping and to obtain better serial CPU utilization by performing block-matrix operations (Problem 8.15).

From the discussion in this section, we conclude that pipelined parallel Gaussian elimination for an n x n matrix takes time Q(n3/p) on p processes with both 1-D and 2-D partitioning schemes. 2-D partitioning can use more processes (O(n2)) than 1-D partitioning (O(n)) for an n x n coefficient matrix. Hence, an implementation with 2-D partitioning is more scalable.

#### 8.3.2 Gaussian Elimination with Partial Pivoting

The Gaussian elimination algorithm in Algorithm 8.4 fails if any diagonal entry A[k, k] of the matrix of coefficients is close or equal to zero. To avoid this problem and to ensure the numerical stability of the algorithm, a technique called partial pivoting is used. At the beginning of the outer loop in the kth iteration, this method selects a column i (called the pivot column) such that A[k, i] is the largest in magnitude among all A[k, j] such that k j < n. It then exchanges the kth and the i th columns before starting the iteration. These columns can either be exchanged explicitly by physically moving them into each other's locations, or they can be exchanged implicitly by simply maintaining an n x 1 permutation vector to keep track of the new indices of the columns of A. If partial pivoting is performed with an implicit exchange of column indices, then the factors L and U are not exactly triangular matrices, but columnwise permutations of triangular matrices.

Assuming that columns are exchanged explicitly, the value of A[k, k] used as the divisor on line 6 of Algorithm 8.4 (after exchanging columns k and i) is greater than or equal to any A[k, j] that it divides in the kth iteration. Partial pivoting in Algorithm 8.4 results in a unit upper-triangular matrix in which all elements above the principal diagonal have an absolute value of less than one.

##### 1-D Partitioning

Performing partial pivoting is straightforward with rowwise partitioning as discussed in Section 8.3.1. Before performing the divide operation in the kth iteration, the process storing the kth row makes a comparison pass over the active portion of this row, and selects the element with the largest absolute value as the divisor. This element determines the pivot column, and all processes must know the index of this column. This information can be passed on to the rest of the processes along with the modified (after the division) elements of the kth row. The combined pivot-search and division step takes time Q(n - k - 1) in the kth iteration, as in case of Gaussian elimination without pivoting. Thus, partial pivoting has no significant effect on the performance of Algorithm 8.4 if the coefficient matrix is partitioned along the rows.

Now consider a columnwise 1-D partitioning of the coefficient matrix. In the absence of pivoting, parallel implementations of Gaussian elimination with rowwise and columnwise 1-D partitioning are almost identical (Problem 8.9). However, the two are significantly different if partial pivoting is performed.

The first difference is that, unlike rowwise partitioning, the pivot search is distributed in columnwise partitioning. If the matrix size is n x n and the number of processes is p, then the pivot search in columnwise partitioning involves two steps. During pivot search for the kth iteration, first each process determines the maximum of the n/p (or fewer) elements of the kth row that it stores. The next step is to find the maximum of the resulting p (or fewer) values, and to distribute the maximum among all processes. Each pivot search takes time Q(n/p) + Q(log p). For sufficiently large values of n with respect to p, this is less than the time Q(n) it takes to perform a pivot search with rowwise partitioning. This seems to suggest that a columnwise partitioning is better for partial pivoting that a rowwise partitioning. However, the following factors favor rowwise partitioning.

Figure 8.7 shows how communication and computation "fronts" move from top to bottom in the pipelined version of Gaussian elimination with rowwise 1-D partitioning. Similarly, the communication and computation fronts move from left to right in case of columnwise 1-D partitioning. This means that the (k + 1)th row is not ready for pivot search for the (k + 1)th iteration (that is, it is not fully updated) until the front corresponding to the kth iteration reaches the rightmost process. As a result, the (k + 1)th iteration cannot start until the entire kth iteration is complete. This effectively eliminates pipelining, and we are therefore forced to use the synchronous version with poor efficiency.

While performing partial pivoting, columns of the coefficient matrix may or may not be explicitly exchanged. In either case, the performance of Algorithm 8.4 is adversely affected with columnwise 1-D partitioning. Recall that cyclic or block-cyclic mappings result in a better load balance in Gaussian elimination than a block mapping. A cyclic mapping ensures that the active portion of the matrix is almost uniformly distributed among the processes at every stage of Gaussian elimination. If pivot columns are not exchanged explicitly, then this condition may cease to hold. After a pivot column is used, it no longer stays in the active portion of the matrix. As a result of pivoting without explicit exchange, columns are arbitrarily removed from the different processes' active portions of the matrix. This randomness may disturb the uniform distribution of the active portion. On the other hand, if columns belonging to different processes are exchanged explicitly, then this exchange requires communication between the processes. A rowwise 1-D partitioning neither requires communication for exchanging columns, nor does it lose the load-balance if columns are not exchanged explicitly.

##### 2-D Partitioning

In the case of 2-D partitioning of the coefficient matrix, partial pivoting seriously restricts pipelining, although it does not completely eliminate it. Recall that in the pipelined version of Gaussian elimination with 2-D partitioning, fronts corresponding to various iterations move from top-left to bottom-right. The pivot search for the (k + 1)th iteration can commence as soon as the front corresponding to the kth iteration has moved past the diagonal of the active matrix joining its top-right and bottom-left corners.

Thus, partial pivoting may lead to considerable performance degradation in parallel Gaussian elimination with 2-D partitioning. If numerical considerations allow, it may be possible to reduce the performance loss due to partial pivoting. We can restrict the search for the pivot in the kth iteration to a band of q columns (instead of all n - k columns). In this case, the i th column is selected as the pivot in the kth iteration if A[k, i] is the largest element in a band of q elements of the active part of the i th row. This restricted partial pivoting not only reduces the communication cost, but also permits limited pipelining. By restricting the number of columns for pivot search to q, an iteration can start as soon as the previous iteration has updated the first q + 1 columns.

Another way to get around the loss of pipelining due to partial pivoting in Gaussian elimination with 2-D partitioning is to use fast algorithms for one-to-all broadcast, such as those described in Section 4.7.1. With 2-D partitioning of the n x n coefficient matrix on p processes, a process spends time Q() in communication in each iteration of the pipelined version of Gaussian elimination. Disregarding the message startup time ts, a non-pipelined version that performs explicit one-to-all broadcasts using the algorithm of Section 4.1 spends time Q(() log p) communicating in each iteration. This communication time is higher than that of the pipelined version. The one-to-all broadcast algorithms described in Section 4.7.1 take time Q() in each iteration (disregarding the startup time). This time is asymptotically equal to the per-iteration communication time of the pipelined algorithm. Hence, using a smart algorithm to perform one-to-all broadcast, even non-pipelined parallel Gaussian elimination can attain performance comparable to that of the pipelined algorithm. However, the one-to-all broadcast algorithms described in Section 4.7.1 split a message into smaller parts and route them separately. For these algorithms to be effective, the sizes of the messages should be large enough; that is, n should be large compared to p.

Although pipelining and pivoting do not go together in Gaussian elimination with 2-D partitioning, the discussion of 2-D partitioning in this section is still useful. With some modification, it applies to the Cholesky factorization algorithm (Algorithm 8.6 in Problem 8.16), which does not require pivoting. Cholesky factorization applies only to symmetric, positive definite matrices. A real n x n matrix A is positive definite if xT Ax > 0 for any n x 1 nonzero, real vector x. The communication pattern in Cholesky factorization is quite similar to that of Gaussian elimination (Problem 8.16), except that, due to symmetric lower and upper-triangular halves in the matrix, Cholesky factorization uses only one triangular half of the matrix.

#### 8.3.3 Solving a Triangular System: Back-Substitution

We now briefly discuss the second stage of solving a system of linear equations. After the full matrix A has been reduced to an upper-triangular matrix U with ones along the principal diagonal, we perform back-substitution to determine the vector x. A sequential back-substitution algorithm for solving an upper-triangular system of equations Ux = y is shown in Algorithm 8.5.

Starting with the last equation, each iteration of the main loop (lines 3-8) of Algorithm 8.5 computes the values of a variable and substitutes the variable's value back into the remaining equations. The program performs approximately n2/2 multiplications and subtractions. Note that the number of arithmetic operations in back-substitution is less than that in Gaussian elimination by a factor of Q(n). Hence, if back-substitution is used in conjunction with Gaussian elimination, it is best to use the matrix partitioning scheme that is the most efficient for parallel Gaussian elimination.

##### Algorithm 8.5 A serial algorithm for back-substitution. U is an upper-triangular matrix with all entries of the principal diagonal equal to one, and all subdiagonal entries equal to zero.
```1.   procedure BACK_SUBSTITUTION (U, x, y)
2.   begin
3.      for k := n - 1 downto 0 do /* Main loop */
4.          begin
5.             x[k] := y[k];
6.             for i := k - 1 downto 0 do
7.                 y[i] := y[i] - x[k] x U[i, k];
8.          endfor;
9.   end BACK_SUBSTITUTION ```

Consider a rowwise block 1-D mapping of the n x n matrix U onto p processes. Let the vector y be distributed uniformly among all the processes. The value of the variable solved in a typical iteration of the main loop (line 3) must be sent to all the processes with equations involving that variable. This communication can be pipelined (Problem 8.22). If so, the time to perform the computations of an iteration dominates the time that a process spends in communication in an iteration. In every iteration of a pipelined implementation, a process receives (or generates) the value of a variable and sends that value to another process. Using the value of the variable solved in the current iteration, a process also performs up to n/p multiplications and subtractions (lines 6 and 7). Hence, each step of a pipelined implementation requires a constant amount of time for communication and time Q(n/p) for computation. The algorithm terminates in Q(n) steps (Problem 8.22), and the parallel run time of the entire algorithm is Q(n2/p).

If the matrix is partitioned by using 2-D partitioning on a logical mesh of processes, and the elements of the vector are distributed along one of the columns of the process mesh, then only the processes containing the vector perform any computation. Using pipelining to communicate the appropriate elements of U to the process containing the corresponding elements of y for the substitution step (line 7), the algorithm can be executed in time Q (Problem 8.22). Thus, the cost of parallel back-substitution with 2-D mapping is Q. The algorithm is not cost-optimal because its sequential cost is only Q(n2). However, the entire process of solving the linear system, including upper-triangularization using Gaussian elimination, is still cost-optimal for because the sequential complexity of the entire process is Q(n3).

#### 8.3.4 Numerical Considerations in Solving Systems of Linear Equations

A system of linear equations of the form Ax = b can be solved by using a factorization algorithm to express A as the product of a lower-triangular matrix L, and a unit upper-triangular matrix U. The system of equations is then rewritten as LU x = b, and is solved in two steps. First, the lower-triangular system Ly = b is solved for y. Second, the upper-triangular system Ux = y is solved for x.

The Gaussian elimination algorithm given in Algorithm 8.4 effectively factorizes A into L and U. However, it also solves the lower-triangular system Ly = b on the fly by means of steps on lines 7 and 13. Algorithm 8.4 gives what is called a row-oriented Gaussian elimination algorithm. In this algorithm, multiples of rows are subtracted from other rows. If partial pivoting, as described in Section 8.3.2, is incorporated into this algorithm, then the resulting upper-triangular matrix U has all its elements less than or equal to one in magnitude. The lower-triangular matrix L, whether implicit or explicit, may have elements with larger numerical values. While solving the system Ax = b, the triangular system Ly = b is solved first. If L contains large elements, then rounding errors can occur while solving for y due to the finite precision of floating-point numbers in the computer. These errors in y are propagated through the solution of Ux = y.

An alternate form of Gaussian elimination is the column-oriented form that can be obtained from Algorithm 8.4 by reversing the roles of rows and columns. In the column-oriented algorithm, multiples of columns are subtracted from other columns, pivot search is also performed along the columns, and numerical stability is guaranteed by row interchanges, if needed. All elements of the lower-triangular matrix L generated by the column-oriented algorithm have a magnitude less than or equal to one. This minimizes numerical error while solving Ly = b, and results in a significantly smaller error in the overall solution than the row-oriented algorithm. Algorithm 3.3 gives a procedure for column-oriented LU factorization.

From a practical point of view, the column-oriented Gaussian elimination algorithm is more useful than the row-oriented algorithm. We have chosen to present the row-oriented algorithm in detail in this chapter because it is more intuitive. It is easy to see that the system of linear equations resulting from the subtraction of a multiple of an equation from other equations is equivalent to the original system. The entire discussion on the row-oriented algorithm of Algorithm 8.4 presented in this section applies to the column-oriented algorithm with the roles of rows and columns reversed. For example, columnwise 1-D partitioning is more suitable than rowwise 1-D partitioning for the column-oriented algorithm with partial pivoting.