10.4 All-Pairs Shortest PathsInstead of finding the shortest paths from a single vertex v to every other vertex, we are sometimes interested in finding the shortest paths between all pairs of vertices. Formally, given a weighted graph G(V, E, w), the all-pairs shortest paths problem is to find the shortest paths between all pairs of vertices v_{i}, v_{j} V such that i j. For a graph with n vertices, the output of an all-pairs shortest paths algorithm is an n x n matrix D = (d_{i}_{,}_{j}) such that d_{i}_{,}_{j} is the cost of the shortest path from vertex v_{i} to vertex v_{j}. The following sections present two algorithms to solve the all-pairs shortest paths problem. The first algorithm uses Dijkstra's single-source shortest paths algorithm, and the second uses Floyd's algorithm. Dijkstra's algorithm requires non-negative edge weights (Problem 10.4), whereas Floyd's algorithm works with graphs having negative-weight edges provided they contain no negative-weight cycles. 10.4.1 Dijkstra's AlgorithmIn Section 10.3 we presented Dijkstra's algorithm for finding the shortest paths from a vertex v to all the other vertices in a graph. This algorithm can also be used to solve the all-pairs shortest paths problem by executing the single-source algorithm on each process, for each vertex v. We refer to this algorithm as Dijkstra's all-pairs shortest paths algorithm. Since the complexity of Dijkstra's single-source algorithm is Q(n^{2}), the complexity of the all-pairs algorithm is Q(n^{3}). Parallel FormulationsDijkstra's all-pairs shortest paths problem can be parallelized in two distinct ways. One approach partitions the vertices among different processes and has each process compute the single-source shortest paths for all vertices assigned to it. We refer to this approach as the source-partitioned formulation. Another approach assigns each vertex to a set of processes and uses the parallel formulation of the single-source algorithm (Section 10.3) to solve the problem on each set of processes. We refer to this approach as the source-parallel formulation. The following sections discuss and analyze these two approaches. Source-Partitioned Formulation The source-partitioned parallel formulation of Dijkstra's algorithm uses n processes. Each process P_{i} finds the shortest paths from vertex v_{i} to all other vertices by executing Dijkstra's sequential single-source shortest paths algorithm. It requires no interprocess communication (provided that the adjacency matrix is replicated at all processes). Thus, the parallel run time of this formulation is given by Since the sequential run time is W = Q(n^{3}), the speedup and efficiency are as follows: It might seem that, due to the absence of communication, this is an excellent parallel formulation. However, that is not entirely true. The algorithm can use at most n processes. Therefore, the isoefficiency function due to concurrency is Q(p^{3}), which is the overall isoefficiency function of the algorithm. If the number of processes available for solving the problem is small (that is, n = Q(p)), then this algorithm has good performance. However, if the number of processes is greater than n, other algorithms will eventually outperform this algorithm because of its poor scalability. Source-Parallel Formulation The major problem with the source-partitioned parallel formulation is that it can keep only n processes busy doing useful work. Performance can be improved if the parallel formulation of Dijkstra's single-source algorithm (Section 10.3) is used to solve the problem for each vertex v. The source-parallel formulation is similar to the source-partitioned formulation, except that the single-source algorithm runs on disjoint subsets of processes. Specifically, p processes are divided into n partitions, each with p/n processes (this formulation is of interest only if p > n). Each of the n single-source shortest paths problems is solved by one of the n partitions. In other words, we first parallelize the all-pairs shortest paths problem by assigning each vertex to a separate set of processes, and then parallelize the single-source algorithm by using the set of p/n processes to solve it. The total number of processes that can be used efficiently by this formulation is O (n^{2}). The analysis presented in Section 10.3 can be used to derive the performance of this formulation of Dijkstra's all-pairs algorithm. Assume that we have a p-process message-passing computer such that p is a multiple of n. The p processes are partitioned into n groups of size p/n each. If the single-source algorithm is executed on each p/n process group, the parallel run time is Notice the similarities between Equations 10.3 and 10.2. These similarities are not surprising because each set of p/n processes forms a different group and carries out the computation independently. Thus, the time required by each set of p/n processes to solve the single-source problem determines the overall run time. Since the sequential run time is W = Q(n^{3}), the speedup and efficiency are as follows: From Equation 10.4 we see that for a cost-optimal formulation p log p/n^{2} = O (1). Hence, this formulation can use up to O (n^{2}/log n) processes efficiently. Equation 10.4 also shows that the isoefficiency function due to communication is Q((p log p)^{1.5}). The isoefficiency function due to concurrency is Q(p^{1.5}). Thus, the overall isoefficiency function is Q((p log p)^{1.5}). Comparing the two parallel formulations of Dijkstra's all-pairs algorithm, we see that the source-partitioned formulation performs no communication, can use no more than n processes, and solves the problem in time Q(n^{2}). In contrast, the source-parallel formulation uses up to n^{2}/log n processes, has some communication overhead, and solves the problem in time Q(n log n) when n^{2}/log n processes are used. Thus, the source-parallel formulation exploits more parallelism than does the source-partitioned formulation. 10.4.2 Floyd's AlgorithmFloyd's algorithm for solving the all-pairs shortest paths problem is based on the following observation. Let G = (V, E, w) be the weighted graph, and let V = {v_{1},v_{2},..., v_{n}} be the vertices of G. Consider a subset {v_{1}, v_{2},..., v_{k}} of vertices for some k where k n. For any pair of vertices v_{i} , v_{j} V, consider all paths from v_{i} to v_{j} whose intermediate vertices belong to the set {v_{1}, v_{2},..., vk }. Let be the minimum-weight path among them, and let be the weight of . If vertex v_{k} is not in the shortest path from v_{i} to v _{j}, then is the same as . However, if v_{k} is in , then we can break into two paths - one from v_{i} to v_{k} and one from v_{k} to v_{j}. Each of these paths uses vertices from {v_{1}, v_{2},..., v_{k}_{-1}}. Thus, . These observations are expressed in the following recurrence equation: The length of the shortest path from v_{i} to v_{j} is given by . In general, the solution is a matrix . Floyd's algorithm solves Equation 10.5 bottom-up in the order of increasing values of k. Algorithm 10.3 shows Floyd's all-pairs algorithm. The run time of Floyd's algorithm is determined by the triple-nested for loops in lines 4-7. Each execution of line 7 takes time Q(1); thus, the complexity of the algorithm is Q(n^{3}). Algorithm 10.3 seems to imply that we must store n matrices of size n xn. However, when computing matrix D^{(}^{k}^{)}, only matrix D^{(}^{k}^{-1)} is needed. Consequently, at most two n x n matrices must be stored. Therefore, the overall space complexity is Q(n^{2}). Furthermore, the algorithm works correctly even when only one copy of D is used (Problem 10.6). Algorithm 10.3 Floyd's all-pairs shortest paths algorithm. This program computes the all-pairs shortest paths of the graph G = (V, E ) with adjacency matrix A.1. procedure FLOYD_ALL_PAIRS_SP( A) 2. begin 3. D^{(0)} = A; 4. for k := 1 to n do 5. for i := 1 to n do 6. for j := 1 to n do 7. ; 8. end FLOYD_ALL_PAIRS_SP Parallel FormulationA generic parallel formulation of Floyd's algorithm assigns the task of computing matrix D^{(}^{k}^{)} for each value of k to a set of processes. Let p be the number of processes available. Matrix D^{(}^{k}^{)} is partitioned into p parts, and each part is assigned to a process. Each process computes the D^{(}^{k}^{)} values of its partition. To accomplish this, a process must access the corresponding segments of the k^{th} row and column of matrix D^{(}^{k}^{-1)}. The following section describes one technique for partitioning matrix D^{(}^{k}^{)}. Another technique is considered in Problem 10.8. 2-D Block Mapping One way to partition matrix D^{(}^{k}^{)} is to use the 2-D block mapping (Section 3.4.1). Specifically, matrix D^{(}^{k}^{)} is divided into p blocks of size , and each block is assigned to one of the p processes. It is helpful to think of the p processes as arranged in a logical grid of size . Note that this is only a conceptual layout and does not necessarily reflect the actual interconnection network. We refer to the process on the i^{th} row and j^{th} column as P_{i}_{,}_{j}. Process P_{i}_{,}_{j} is assigned a subblock of D^{(}^{k}^{)} whose upper-left corner is and whose lower-right corner is . Each process updates its part of the matrix during each iteration. Figure 10.7(a) illustrates the 2-D block mapping technique. Figure 10.7. (a) Matrix D^{(}^{k}^{)} distributed by 2-D block mapping into subblocks, and (b) the subblock of D^{(}^{k}^{)} assigned to process P_{i}_{,}_{j}.During the k^{th} iteration of the algorithm, each process P_{i}_{,}_{j} needs certain segments of the k^{th} row and k^{th} column of the D^{(}^{k}^{-1)} matrix. For example, to compute it must get and . As Figure 10.8 illustrates, resides on a process along the same row, and element resides on a process along the same column as P_{i}_{,}_{j}. Segments are transferred as follows. During the k^{th} iteration of the algorithm, each of the processes containing part of the k^{th} row sends it to the processes in the same column. Similarly, each of the processes containing part of the k^{th} column sends it to the processes in the same row. Figure 10.8. (a) Communication patterns used in the 2-D block mapping. When computing , information must be sent to the highlighted process from two other processes along the same row and column. (b) The row and column of processes that contain the k^{th} row and column send them along process columns and rows.Algorithm 10.4 shows the parallel formulation of Floyd's algorithm using the 2-D block mapping. We analyze the performance of this algorithm on a p-process message-passing computer with a cross-bisection bandwidth of Q(p). During each iteration of the algorithm, the k^{th} row and k^{th} column of processes perform a one-to-all broadcast along a row or a column of processes. Each such process has elements of the k^{th} row or column, so it sends elements. This broadcast requires time Q. The synchronization step on line 7 requires time Q(log p). Since each process is assigned n^{2}/p elements of the D^{(}^{k}^{)} matrix, the time to compute corresponding D^{(}^{k}^{)} values is Q(n^{2}/p). Therefore, the parallel run time of the 2-D block mapping formulation of Floyd's algorithm is Since the sequential run time is W = Q(n^{3}), the speedup and efficiency are as follows: From Equation 10.6 we see that for a cost-optimal formulation ; thus, 2-D block mapping can efficiently use up to O(n^{2}/log^{2}n) processes. Equation 10.6 can also be used to derive the isoefficiency function due to communication, which is Q(p^{1.5} log^{3} p). The isoefficiency function due to concurrency is Q(p^{1}^{.}^{5}). Thus, the overall isoefficiency function is Q(p^{1.5} log^{3} p). Speeding Things Up In the 2-D block mapping formulation of Floyd's algorithm, a synchronization step ensures that all processes have the appropriate segments of matrix D^{(}^{k}^{-1)} before computing elements of matrix D^{(}^{k}^{)} (line 7 in Algorithm 10.4). In other words, the k^{th} iteration starts only when the (k - 1)^{th} iteration has completed and the relevant parts of matrix D^{(}^{k}^{-1)} have been transmitted to all processes. The synchronization step can be removed without affecting the correctness of the algorithm. To accomplish this, a process starts working on the k^{th} iteration as soon as it has computed the (k -1)^{th} iteration and has the relevant parts of the D^{(}^{k}^{-1)} matrix. This formulation is called pipelined 2-D block mapping. A similar technique is used in Section 8.3 to improve the performance of Gaussian elimination. Algorithm 10.4 Floyd's parallel formulation using the 2-D block mapping. P_{*,}_{j} denotes all the processes in the j^{th} column, and P_{i}_{,*} denotes all the processes in the i^{th} row. The matrix D^{(0)} is the adjacency matrix.1. procedure FLOYD_2DBLOCK(D^{(0)}) 2. begin 3. for k := 1 to n do 4. begin 5. each process P_{i}_{,}_{j} that has a segment of the k^{th} row of D^{(}^{k}^{-1)}; broadcasts it to the P_{*,}_{j} processes; 6. each process P_{i}_{,}_{j} that has a segment of the k^{th} column of D^{(}^{k}^{-1)}; broadcasts it to the P_{i}_{,*} processes; 7. each process waits to receive the needed segments; 8. each process P_{i}_{,}_{j} computes its part of the D^{(}^{k}^{)} matrix; 9. end 10. end FLOYD_2DBLOCK Consider a p-process system arranged in a two-dimensional topology. Assume that process P_{i}_{,}_{j} starts working on the k^{th} iteration as soon as it has finished the (k - 1)^{th} iteration and has received the relevant parts of the D^{(}^{k}^{-1)} matrix. When process P_{i}_{,}_{j} has elements of the k^{th} row and has finished the (k - 1)^{th} iteration, it sends the part of matrix D^{(}^{k}^{-1)} stored locally to processes P_{i}_{,}_{j} _{-1} and P_{i}_{,} _{j} _{+1}. It does this because that part of the D^{(}^{k}^{-1)} matrix is used to compute the D^{(}^{k}^{)} matrix. Similarly, when process P_{i}_{,}_{j} has elements of the k^{th} column and has finished the (k - 1)^{th} iteration, it sends the part of matrix D^{(}^{k}^{-1)} stored locally to processes P_{i} _{-1,} _{j} and P_{i} _{+1,} _{j}. When process P_{i}_{,}_{j} receives elements of matrix D^{(}^{k}^{)} from a process along its row in the logical mesh, it stores them locally and forwards them to the process on the side opposite from where it received them. The columns follow a similar communication protocol. Elements of matrix D^{(}^{k}^{)} are not forwarded when they reach a mesh boundary. Figure 10.9 illustrates this communication and termination protocol for processes within a row (or a column). Figure 10.9. Communication protocol followed in the pipelined 2-D block mapping formulation of Floyd's algorithm. Assume that process 4 at time t has just computed a segment of the k^{th} column of the D^{(}^{k}^{-1)} matrix. It sends the segment to processes 3 and 5. These processes receive the segment at time t + 1 (where the time unit is the time it takes for a matrix segment to travel over the communication link between adjacent processes). Similarly, processes farther away from process 4 receive the segment later. Process 1 (at the boundary) does not forward the segment after receiving it.Consider the movement of values in the first iteration. In each step, elements of the first row are sent from process P_{i}_{,}_{j} to P_{i} _{+1,}_{j}. Similarly, elements of the first column are sent from process P_{i}_{,}_{j} to process P_{i}_{,}_{j}_{+1}. Each such step takes time Q(). After Q steps, process gets the relevant elements of the first row and first column in time Q(n). The values of successive rows and columns follow after time Q(n^{2}/p) in a pipelined mode. Hence, process finishes its share of the shortest path computation in time Q(n^{3}/p) + Q(n). When process has finished the (n - 1)^{th} iteration, it sends the relevant values of the n^{th} row and column to the other processes. These values reach process P_{1,1} in time Q(n). The overall parallel run time of this formulation is Since the sequential run time is W = Q(n^{3}), the speedup and efficiency are as follows:
From Equation 10.7 we see that for a cost-optimal formulation p/n^{2} = O (1). Thus, the pipelined formulation of Floyd's algorithm uses up to O (n^{2}) processes efficiently. Also from Equation 10.7, we can derive the isoefficiency function due to communication, which is Q(p^{1.5}). This is the overall isoefficiency function as well. Comparing the pipelined formulation to the synchronized 2-D block mapping formulation, we see that the former is significantly faster. 10.4.3 Performance ComparisonsThe performance of the all-pairs shortest paths algorithms previously presented is summarized in Table 10.1 for a parallel architecture with O (p) bisection bandwidth. Floyd's pipelined formulation is the most scalable and can use up to Q(n^{2}) processes to solve the problem in time Q(n). Moreover, this parallel formulation performs equally well even on architectures with bisection bandwidth O , such as a mesh-connected computer. Furthermore, its performance is independent of the type of routing (store-and-forward or cut-through). |