## 8.3 Solving a System of Linear EquationsThis section discusses the problem of solving a system of linear equations of the form In matrix notation, this system is written as Ax = b. Here A is a dense n x n matrix of coefficients such that A[i, j] = a A system of equations Ax = b is usually solved in two stages. First, through a series of algebraic manipulations, the original system of equations is reduced to an upper-triangular system of the form We write this as Ux = y, where U is a unit upper-triangular matrix - one in which all subdiagonal entries are zero and all principal diagonal entries are equal to one. Formally, U[i, j] = 0 if i > j, otherwise U[i, j] = u We discuss parallel formulations of the classical Gaussian elimination method for upper-triangularization in Sections 8.3.1 and 8.3.2. In Section 8.3.1, we describe a straightforward Gaussian elimination algorithm assuming that the coefficient matrix is nonsingular, and its rows and columns are permuted in a way that the algorithm is numerically stable. Section 8.3.2 discusses the case in which a numerically stable solution of the system of equations requires permuting the columns of the matrix during the execution of the Gaussian elimination algorithm. Although we discuss Gaussian elimination in the context of upper-triangularization, a similar procedure can be used to factorize matrix A as the product of a lower-triangular matrix L and a unit upper-triangular matrix U so that A = L x U. This factorization is commonly referred to as LU factorization. Performing LU factorization (rather than upper-triangularization) is particularly useful if multiple systems of equations with the same left-hand side Ax need to be solved. Algorithm 3.3 gives a procedure for column-oriented LU factorization. ## 8.3.1 A Simple Gaussian Elimination AlgorithmThe serial Gaussian elimination algorithm has three nested loops. Several variations of the algorithm exist, depending on the order in which the loops are arranged. Algorithm 8.4 shows one variation of Gaussian elimination, which we will adopt for parallel implementation in the remainder of this section. This program converts a system of linear equations Ax = b to a unit upper-triangular system Ux = y. We assume that the matrix U shares storage with A and overwrites the upper-triangular portion of A. The element A[k, j] computed on line 6 of Algorithm 8.4 is actually U [k, j]. Similarly, the element A[k, k] equated to 1 on line 8 is U [k, k]. Algorithm 8.4 assumes that A[k, k] 0 when it is used as a divisor on lines 6 and 7. ## Algorithm 8.4 A serial Gaussian elimination algorithm that converts the system of linear equations Ax = b to a unit upper-triangular system Ux = y. The matrix U occupies the upper-triangular locations of A. This algorithm assumes that A[k, k] 0 when it is used as a divisor on lines 6 and 7.1. procedure GAUSSIAN_ELIMINATION (A, b, y) 2. begin 3. for k := 0 to n - 1 do /* Outer loop */ 4. begin 5. for j := k + 1 to n - 1 do 6. A[k, j] := A[k, j]/A[k, k]; /* Division step */ 7. y[k] := b[k]/A[k, k]; 8. A[k, k] := 1; 9. for i := k + 1 to n - 1 do 10. begin 11. for j := k + 1 to n - 1 do 12. A[i, j] := A[i, j] - A[i, k] x A[k, j]; /* Elimination step */ 13. b[i] := b[i] - A[i, k] x y[k]; 14. A[i, k] := 0; 15. endfor; /* Line 9 */ 16. endfor; /* Line 3 */ 17. end GAUSSIAN_ELIMINATION In this section, we will concentrate only on the operations on matrix A in Algorithm 8.4. The operations on vector b on lines 7 and 13 of the program are straightforward to implement. Hence, in the rest of the section, we will ignore these steps. If the steps on lines 7, 8, 13, and 14 are not performed, then Algorithm 8.4 leads to the LU factorization of A as a product L x U. After the termination of the procedure, L is stored in the lower-triangular part of A, and U occupies the locations above the principal diagonal. For k varying from 0 to n - 1, the Gaussian elimination procedure systematically eliminates variable x[k] from equations k + 1 to n - 1 so that the matrix of coefficients becomes upper-triangular. As shown in Algorithm 8.4, in the kth iteration of the outer loop (starting on line 3), an appropriate multiple of the kth equation is subtracted from each of the equations k + 1 to n - 1 (loop starting on line 9). The multiples of the kth equation (or the kth row of matrix A) are chosen such that the kth coefficient becomes zero in equations k + 1 to n - 1 eliminating x [k] from these equations. A typical computation of the Gaussian elimination procedure in the kth iteration of the outer loop is shown in Figure 8.5. The kth iteration of the outer loop does not involve any computation on rows 1 to k - 1 or columns 1 to k - 1. Thus, at this stage, only the lower-right (n - k) x (n - k) submatrix of A (the shaded portion in Figure 8.5) is computationally active. ## Figure 8.5. A typical computation in Gaussian elimination.Gaussian elimination involves approximately n ## Parallel Implementation with 1-D PartitioningWe now consider a parallel implementation of Algorithm 8.4, in which the coefficient matrix is rowwise 1-D partitioned among the processes. A parallel implementation of this algorithm with columnwise 1-D partitioning is very similar, and its details can be worked out based on the implementation using rowwise 1-D partitioning (Problems 8.8 and 8.9). We first consider the case in which one row is assigned to each process, and the n x n coefficient matrix A is partitioned along the rows among n processes labeled from P ## Figure 8.6. Gaussian elimination steps during the iteration corresponding to k = 3 for an 8 x 8 matrix partitioned rowwise among eight processes.Algorithm 8.4 and Figure 8.5 show that A[k, k + 1], A[k, k + 2], ..., A[k, n - 1] are divided by A[k, k] (line 6) at the beginning of the kth iteration. All matrix elements participating in this operation (shown by the shaded portion of the matrix in Figure 8.6(a)) belong to the same process. So this step does not require any communication. In the second computation step of the algorithm (the elimination step of line 12), the modified (after division) elements of the kth row are used by all other rows of the active part of the matrix. As Figure 8.6(b) shows, this requires a one-to-all broadcast of the active part of the kth row to the processes storing rows k + 1 to n - 1. Finally, the computation A[i, j] := A[i, j] - A[i, k] x A[k, j] takes place in the remaining active portion of the matrix, which is shown shaded in Figure 8.6(c). The computation step corresponding to Figure 8.6(a) in the kth iteration requires n - k - 1 divisions at process P The communication step of Figure 8.6(b) takes time (t Since the number of processes is n, the cost, or the process-time product, is Q(n3 log n) due to the term associated with t Pipelined Communication and Computation We now present a parallel implementation of Gaussian elimination that is cost-optimal on n processes. In the parallel Gaussian elimination algorithm just presented, the n iterations of the outer loop of Algorithm 8.4 execute sequentially. At any given time, all processes work on the same iteration. The (k + 1)th iteration starts only after all the computation and communication for the kth iteration is complete. The performance of the algorithm can be improved substantially if the processes work asynchronously; that is, no process waits for the others to finish an iteration before starting the next one. We call this the asynchronous or pipelined version of Gaussian elimination. Figure 8.7 illustrates the pipelined Algorithm 8.4 for a 5 x 5 matrix partitioned along the rows onto a logical linear array of five processes. ## Figure 8.7. Pipelined Gaussian elimination on a 5 x 5 matrix partitioned with one row per process.During the kth iteration of Algorithm 8.4, process P In pipelined Gaussian elimination, each process independently performs the following sequence of actions repeatedly until all n iterations are complete. For the sake of simplicity, we assume that steps (1) and (2) take the same amount of time (this assumption does not affect the analysis): If a process has any data destined for other processes, it sends those data to the appropriate process. If the process can perform some computation using the data it has, it does so. Otherwise, the process waits to receive data to be used for one of the above actions.
Figure 8.7 shows the 16 steps in the pipelined parallel execution of Gaussian elimination for a 5 x 5 matrix partitioned along the rows among five processes. As Figure 8.7(a) shows, the first step is to perform the division on row 0 at process P We now show that, unlike the synchronous algorithm in which all processes work on the same iteration at a time, the pipelined or the asynchronous version of Gaussian elimination is cost-optimal. As Figure 8.7 shows, the initiation of consecutive iterations of the outer loop of Algorithm 8.4 is separated by a constant number of steps. A total of n such iterations are initiated. The last iteration modifies only the bottom-right corner element of the coefficient matrix; hence, it completes in a constant time after its initiation. Thus, the total number of steps in the entire pipelined procedure is Q(n) (Problem 8.7). In any step, either O(n) elements are communicated between directly-connected processes, or a division step is performed on O(n) elements of a row, or an elimination step is performed on O(n) elements of a row. Each of these operations take O(n) time. Hence, the entire procedure consists of Q(n) steps of O(n) complexity each, and its parallel run time is O(n Block 1-D Partitioning with Fewer than n Processes The preceding pipelined implementation of parallel Gaussian elimination can be easily adapted for the case in which n > p. Consider an n x n matrix partitIoned among p processes (p < n) such that each process is assigned n/p contiguous rows of the matrix. Figure 8.8 illustrates the communication steps in a typical iteration of Gaussian elimination with such a mapping. As the figure shows, the kth iteration of the algorithm requires that the active part of the kth row be sent to the processes storing rows k + 1, k + 2, ..., n - 1. ## Figure 8.8. The communication in the Gaussian elimination iteration corresponding to k = 3 for an 8 x 8 matrix distributed among four processes using block 1-D partitioning.Figure 8.9(a) shows that, with block 1-D partitioning, a process with all rows belonging to the active part of the matrix performs (n - k - 1)n/p multiplications and subtractions during the elimination step of the kth iteration. Note that in the last (n/p) - 1 iterations, no process has all active rows, but we ignore this anomaly. If the pipelined version of the algorithm is used, then the number of arithmetic operations on a maximally-loaded process in the kth iteration (2(n - k - 1)n/p) is much higher than the number of words communicated (n - k - 1) by a process in the same iteration. Thus, for sufficiently large values of n with respect to p, computation dominates communication in each iteration. Assuming that each scalar multiplication and subtraction pair takes unit time, the total parallel run time of this algorithm (ignoring communication overhead) is , which is approximately equal to n ## Figure 8.9. Computation load on different processes in block and cyclic 1-D partitioning of an 8 x 8 matrix on four processes during the Gaussian elimination iteration corresponding to k = 3.The process-time product of this algorithm is n This problem can be alleviated if the matrix is partitioned among the processes using cyclic 1-D mapping as shown in Figure 8.9(b). With the cyclic 1-D partitioning, the difference between the computational loads of a maximally loaded process and the least loaded process in any iteration is of at most one row (that is, O(n) arithmetic operations). Since there are n iterations, the cumulative overhead due to process idling is only O(n ## Parallel Implementation with 2-D PartitioningWe now describe a parallel implementation of Algorithm 8.4 in which the n x n matrix A is mapped onto an n x n mesh of processes such that process P ## Figure 8.10. Various steps in the Gaussian elimination iteration corresponding to k = 3 for an 8 x 8 matrix on 64 processes arranged in a logical two-dimensional mesh.Pipelined Communication and Computation Based on our experience with Gaussian elimination using 1-D partitioning of the coefficient matrix, we develop a pipelined version of the algorithm using 2-D partitioning. As Figure 8.10 shows, in the kth iteration of the outer loop (lines 3-16 of Algorithm 8.4), A[k, k] is sent to the right from P The communication and computation can be pipelined in several ways. We present one such scheme in Figure 8.11. In Figure 8.11(a), the iteration of the outer loop for k = 0 starts at process P ## Figure 8.11. Pipelined Gaussian elimination for a 5 x 5 matrix with 25 processes.All processes performing communication or computation during a particular iteration lie along a diagonal in the bottom-left to top-right direction (for example, P Every step of an iteration, such as division, elimination, or transmitting a value to a neighboring process, is a constant-time operation. Therefore, a front moves a single step closer to the bottom-right corner of the matrix in constant time (equivalent to two steps of Figure 8.11). The front for k = 0 takes time Q(n) to reach P
2-D Partitioning with Fewer than n ## Figure 8.12. The communication steps in the Gaussian elimination iteration corresponding to k = 3 for an 8 x 8 matrix on 16 processes of a two-dimensional mesh.## Figure 8.13. Computational load on different processes in block and cyclic 2-D mappings of an 8 x 8 matrix onto 16 processes during the Gaussian elimination iteration corresponding to k = 3.Figures 8.12 and 8.13(a) show that a process containing a completely active part of the matrix performs n As in the case of a block 1-D mapping, the inefficiency of Gaussian elimination with a block 2-D partitioning of the matrix is due to process idling resulting from an uneven load distribution. Figure 8.13(a) shows the active part of an 8 x 8 matrix of coefficients in the iteration of the outer loop for k = 3 when the matrix is block 2-D partitioned among 16 processes. As shown in the figure, seven out of 16 processes are fully idle, five are partially loaded, and only four are fully active. By the time half of the iterations of the outer loop have been completed, only one-fourth of the processes are active. The remaining idle processes make the parallel algorithm much costlier than the sequential algorithm. This problem can be alleviated if the matrix is partitioned in a 2-D cyclic fashion as shown in Figure 8.13(b). With the cyclic 2-D partitioning, the maximum difference in computational load between any two processes in any iteration is that of one row and one column update. For example, in Figure 8.13(b), n From the discussion in this section, we conclude that pipelined parallel Gaussian elimination for an n x n matrix takes time Q(n ## 8.3.2 Gaussian Elimination with Partial PivotingThe Gaussian elimination algorithm in Algorithm 8.4 fails if any diagonal entry A[k, k] of the matrix of coefficients is close or equal to zero. To avoid this problem and to ensure the numerical stability of the algorithm, a technique called partial pivoting is used. At the beginning of the outer loop in the kth iteration, this method selects a column i (called the pivot column) such that A[k, i] is the largest in magnitude among all A[k, j] such that k j < n. It then exchanges the kth and the i th columns before starting the iteration. These columns can either be exchanged explicitly by physically moving them into each other's locations, or they can be exchanged implicitly by simply maintaining an n x 1 permutation vector to keep track of the new indices of the columns of A. If partial pivoting is performed with an implicit exchange of column indices, then the factors L and U are not exactly triangular matrices, but columnwise permutations of triangular matrices. Assuming that columns are exchanged explicitly, the value of A[k, k] used as the divisor on line 6 of Algorithm 8.4 (after exchanging columns k and i) is greater than or equal to any A[k, j] that it divides in the kth iteration. Partial pivoting in Algorithm 8.4 results in a unit upper-triangular matrix in which all elements above the principal diagonal have an absolute value of less than one. ## 1-D PartitioningPerforming partial pivoting is straightforward with rowwise partitioning as discussed in Section 8.3.1. Before performing the divide operation in the kth iteration, the process storing the kth row makes a comparison pass over the active portion of this row, and selects the element with the largest absolute value as the divisor. This element determines the pivot column, and all processes must know the index of this column. This information can be passed on to the rest of the processes along with the modified (after the division) elements of the kth row. The combined pivot-search and division step takes time Q(n - k - 1) in the kth iteration, as in case of Gaussian elimination without pivoting. Thus, partial pivoting has no significant effect on the performance of Algorithm 8.4 if the coefficient matrix is partitioned along the rows. Now consider a columnwise 1-D partitioning of the coefficient matrix. In the absence of pivoting, parallel implementations of Gaussian elimination with rowwise and columnwise 1-D partitioning are almost identical (Problem 8.9). However, the two are significantly different if partial pivoting is performed. The first difference is that, unlike rowwise partitioning, the pivot search is distributed in columnwise partitioning. If the matrix size is n x n and the number of processes is p, then the pivot search in columnwise partitioning involves two steps. During pivot search for the kth iteration, first each process determines the maximum of the n/p (or fewer) elements of the kth row that it stores. The next step is to find the maximum of the resulting p (or fewer) values, and to distribute the maximum among all processes. Each pivot search takes time Q(n/p) + Q(log p). For sufficiently large values of n with respect to p, this is less than the time Q(n) it takes to perform a pivot search with rowwise partitioning. This seems to suggest that a columnwise partitioning is better for partial pivoting that a rowwise partitioning. However, the following factors favor rowwise partitioning. Figure 8.7 shows how communication and computation "fronts" move from top to bottom in the pipelined version of Gaussian elimination with rowwise 1-D partitioning. Similarly, the communication and computation fronts move from left to right in case of columnwise 1-D partitioning. This means that the (k + 1)th row is not ready for pivot search for the (k + 1)th iteration (that is, it is not fully updated) until the front corresponding to the kth iteration reaches the rightmost process. As a result, the (k + 1)th iteration cannot start until the entire kth iteration is complete. This effectively eliminates pipelining, and we are therefore forced to use the synchronous version with poor efficiency. While performing partial pivoting, columns of the coefficient matrix may or may not be explicitly exchanged. In either case, the performance of Algorithm 8.4 is adversely affected with columnwise 1-D partitioning. Recall that cyclic or block-cyclic mappings result in a better load balance in Gaussian elimination than a block mapping. A cyclic mapping ensures that the active portion of the matrix is almost uniformly distributed among the processes at every stage of Gaussian elimination. If pivot columns are not exchanged explicitly, then this condition may cease to hold. After a pivot column is used, it no longer stays in the active portion of the matrix. As a result of pivoting without explicit exchange, columns are arbitrarily removed from the different processes' active portions of the matrix. This randomness may disturb the uniform distribution of the active portion. On the other hand, if columns belonging to different processes are exchanged explicitly, then this exchange requires communication between the processes. A rowwise 1-D partitioning neither requires communication for exchanging columns, nor does it lose the load-balance if columns are not exchanged explicitly. ## 2-D PartitioningIn the case of 2-D partitioning of the coefficient matrix, partial pivoting seriously restricts pipelining, although it does not completely eliminate it. Recall that in the pipelined version of Gaussian elimination with 2-D partitioning, fronts corresponding to various iterations move from top-left to bottom-right. The pivot search for the (k + 1)th iteration can commence as soon as the front corresponding to the kth iteration has moved past the diagonal of the active matrix joining its top-right and bottom-left corners. Thus, partial pivoting may lead to considerable performance degradation in parallel Gaussian elimination with 2-D partitioning. If numerical considerations allow, it may be possible to reduce the performance loss due to partial pivoting. We can restrict the search for the pivot in the kth iteration to a band of q columns (instead of all n - k columns). In this case, the i th column is selected as the pivot in the kth iteration if A[k, i] is the largest element in a band of q elements of the active part of the i th row. This restricted partial pivoting not only reduces the communication cost, but also permits limited pipelining. By restricting the number of columns for pivot search to q, an iteration can start as soon as the previous iteration has updated the first q + 1 columns. Another way to get around the loss of pipelining due to partial pivoting in Gaussian elimination with 2-D partitioning is to use fast algorithms for one-to-all broadcast, such as those described in Section 4.7.1. With 2-D partitioning of the n x n coefficient matrix on p processes, a process spends time Q() in communication in each iteration of the pipelined version of Gaussian elimination. Disregarding the message startup time t Although pipelining and pivoting do not go together in Gaussian elimination with 2-D partitioning, the discussion of 2-D partitioning in this section is still useful. With some modification, it applies to the Cholesky factorization algorithm (Algorithm 8.6 in Problem 8.16), which does not require pivoting. Cholesky factorization applies only to symmetric, positive definite matrices. A real n x n matrix A is positive definite if x ## 8.3.3 Solving a Triangular System: Back-SubstitutionWe now briefly discuss the second stage of solving a system of linear equations. After the full matrix A has been reduced to an upper-triangular matrix U with ones along the principal diagonal, we perform back-substitution to determine the vector x. A sequential back-substitution algorithm for solving an upper-triangular system of equations Ux = y is shown in Algorithm 8.5. Starting with the last equation, each iteration of the main loop (lines 3-8) of Algorithm 8.5 computes the values of a variable and substitutes the variable's value back into the remaining equations. The program performs approximately n ## Algorithm 8.5 A serial algorithm for back-substitution. U is an upper-triangular matrix with all entries of the principal diagonal equal to one, and all subdiagonal entries equal to zero.1. procedure BACK_SUBSTITUTION (U, x, y) 2. begin 3. for k := n - 1 downto 0 do /* Main loop */ 4. begin 5. x[k] := y[k]; 6. for i := k - 1 downto 0 do 7. y[i] := y[i] - x[k] x U[i, k]; 8. endfor; 9. end BACK_SUBSTITUTION Consider a rowwise block 1-D mapping of the n x n matrix U onto p processes. Let the vector y be distributed uniformly among all the processes. The value of the variable solved in a typical iteration of the main loop (line 3) must be sent to all the processes with equations involving that variable. This communication can be pipelined (Problem 8.22). If so, the time to perform the computations of an iteration dominates the time that a process spends in communication in an iteration. In every iteration of a pipelined implementation, a process receives (or generates) the value of a variable and sends that value to another process. Using the value of the variable solved in the current iteration, a process also performs up to n/p multiplications and subtractions (lines 6 and 7). Hence, each step of a pipelined implementation requires a constant amount of time for communication and time Q(n/p) for computation. The algorithm terminates in Q(n) steps (Problem 8.22), and the parallel run time of the entire algorithm is Q(n If the matrix is partitioned by using 2-D partitioning on a logical mesh of processes, and the elements of the vector are distributed along one of the columns of the process mesh, then only the processes containing the vector perform any computation. Using pipelining to communicate the appropriate elements of U to the process containing the corresponding elements of y for the substitution step (line 7), the algorithm can be executed in time Q (Problem 8.22). Thus, the cost of parallel back-substitution with 2-D mapping is Q. The algorithm is not cost-optimal because its sequential cost is only Q(n ## 8.3.4 Numerical Considerations in Solving Systems of Linear EquationsA system of linear equations of the form Ax = b can be solved by using a factorization algorithm to express A as the product of a lower-triangular matrix L, and a unit upper-triangular matrix U. The system of equations is then rewritten as LU x = b, and is solved in two steps. First, the lower-triangular system Ly = b is solved for y. Second, the upper-triangular system Ux = y is solved for x. The Gaussian elimination algorithm given in Algorithm 8.4 effectively factorizes A into L and U. However, it also solves the lower-triangular system Ly = b on the fly by means of steps on lines 7 and 13. Algorithm 8.4 gives what is called a row-oriented Gaussian elimination algorithm. In this algorithm, multiples of rows are subtracted from other rows. If partial pivoting, as described in Section 8.3.2, is incorporated into this algorithm, then the resulting upper-triangular matrix U has all its elements less than or equal to one in magnitude. The lower-triangular matrix L, whether implicit or explicit, may have elements with larger numerical values. While solving the system Ax = b, the triangular system Ly = b is solved first. If L contains large elements, then rounding errors can occur while solving for y due to the finite precision of floating-point numbers in the computer. These errors in y are propagated through the solution of Ux = y. An alternate form of Gaussian elimination is the column-oriented form that can be obtained from Algorithm 8.4 by reversing the roles of rows and columns. In the column-oriented algorithm, multiples of columns are subtracted from other columns, pivot search is also performed along the columns, and numerical stability is guaranteed by row interchanges, if needed. All elements of the lower-triangular matrix L generated by the column-oriented algorithm have a magnitude less than or equal to one. This minimizes numerical error while solving Ly = b, and results in a significantly smaller error in the overall solution than the row-oriented algorithm. Algorithm 3.3 gives a procedure for column-oriented LU factorization. From a practical point of view, the column-oriented Gaussian elimination algorithm is more useful than the row-oriented algorithm. We have chosen to present the row-oriented algorithm in detail in this chapter because it is more intuitive. It is easy to see that the system of linear equations resulting from the subtraction of a multiple of an equation from other equations is equivalent to the original system. The entire discussion on the row-oriented algorithm of Algorithm 8.4 presented in this section applies to the column-oriented algorithm with the roles of rows and columns reversed. For example, columnwise 1-D partitioning is more suitable than rowwise 1-D partitioning for the column-oriented algorithm with partial pivoting. |