12.2 Serial Monadic DP FormulationsWe can solve many problems by using serial monadic DP formulations. This section discusses the shortest-path problem for a multistage graph and the 0/1 knapsack problem. We present parallel algorithms for both and point out the specific properties that influence these parallel formulations. 12.2.1 The Shortest-Path ProblemConsider a weighted multistage graph of r + 1 levels, as shown in Figure 12.3. Each node at level i is connected to every node at level i + 1. Levels zero and r contain only one node, and every other level contains n nodes. We refer to the node at level zero as the starting node S and the node at level r as the terminating node R. The objective of this problem is to find the shortest path from S to R. The i^{th} node at level l in the graph is labeled . The cost of an edge connecting to node is labeled . The cost of reaching the goal node R from any node is represented by . If there are n nodes at level l, the vector is referred to as C^{l}. The shortest-path problem reduces to computing C^{0}. Since the graph has only one starting node, . The structure of the graph is such that any path from to R includes a node (0 j n - 1). The cost of any such path is the sum of the cost of the edge between and and the cost of the shortest path between and R (which is given by ). Thus, , the cost of the shortest path between and R, is equal to the minimum cost over all paths through each node in level l + 1. Therefore, Figure 12.3. An example of a serial monadic DP formulation for finding the shortest path in a graph whose nodes can be organized into levels.Since all nodes have only one edge connecting them to the goal node R at level r, the cost is equal to . Hence, Because Equation 12.2 contains only one recursive term in its right-hand side, it is a monadic formulation. Note that the solution to a subproblem requires solutions to subproblems only at the immediately preceding level. Consequently, this is a serial monadic formulation. Using this recursive formulation of the shortest-path problem, the cost of reaching the goal node R from any node at level l (0 < l < r - 1) is Now consider the operation of multiplying a matrix with a vector. In the matrix-vector product, if the addition operation is replaced by minimization and the multiplication operation is replaced by addition, the preceding set of equations is equivalent to where C^{l} and C^{l}^{+1} are n x 1 vectors representing the cost of reaching the goal node from each node at levels l and l + 1, and M_{l}_{,}_{l}_{+1} is an n x n matrix in which entry (i, j) stores the cost of the edge connecting node i at level l to node j at level l + 1. This matrix is The shortest-path problem has thus been reformulated as a sequence of matrix-vector multiplications. On a sequential computer, the DP formulation starts by computing C^{r} ^{-1} from Equation 12.3, and then computes C^{r} ^{-}^{k}^{-1} for k = 1, 2, ..., r -2 using Equation 12.4. Finally, C^{0} is computed using Equation 12.2. Since there are n nodes at each level, the cost of computing each vector C^{l} is Q(n^{2}). The parallel algorithm for this problem can be derived using the parallel algorithms for the matrix-vector product discussed in Section 8.1. For example, Q(n) processing elements can compute each vector C^{l} in time Q(n) and solve the entire problem in time Q(rn). Recall that r is the number of levels in the graph. Many serial monadic DP formulations with dependency graphs identical to the one considered here can be parallelized using a similar parallel algorithm. For certain dependency graphs, however, this formulation is unsuitable. Consider a graph in which each node at a level can be reached from only a small fraction of nodes at the previous level. Then matrix M_{l}_{,}_{l}_{+1} contains many elements with value . In this case, matrix M is considered to be a sparse matrix for the minimization and addition operations. This is because, for all x, x + = , and min{x, } = x. Therefore, the addition and minimization operations need not be performed for entries whose value is . If we use a regular dense matrix-vector multiplication algorithm, the computational complexity of each matrix-vector multiplication becomes significantly higher than that of the corresponding sparse matrix-vector multiplication. Consequently, we must use a sparse matrix-vector multiplication algorithm to compute each vector. 12.2.2 The 0/1 Knapsack ProblemA one-dimensional 0/1 knapsack problem is defined as follows. We are given a knapsack of capacity c and a set of n objects numbered 1, 2, ..., n. Each object i has weight w_{i} and profit p_{i}. Object profits and weights are integers. Let v = [v_{1}, v_{2}, ..., v_{n}] be a solution vector in which v_{i} = 0 if object i is not in the knapsack, and v_{i} = 1 if it is in the knapsack. The goal is to find a subset of objects to put into the knapsack so that (that is, the objects fit into the knapsack) and is maximized (that is, the profit is maximized). A straightforward method to solve this problem is to consider all 2^{n} possible subsets of the n objects and choose the one that fits into the knapsack and maximizes the profit. Here we provide a DP formulation that is faster than the simple method when c = O (2^{n} /n). Let F [i, x] be the maximum profit for a knapsack of capacity x using only objects {1, 2, ..., i}. Then F [n, c] is the solution to the problem. The DP formulation for this problem is as follows: This recursive equation yields a knapsack of maximum profit. When the current capacity of the knapsack is x, the decision to include object i can lead to one of two situations: (i) the object is not included, knapsack capacity remains x , and profit is unchanged; (ii) the object is included, knapsack capacity becomes x - w_{i}, and profit increases by p_{i}. The DP algorithm decides whether or not to include an object based on which choice leads to maximum profit. The sequential algorithm for this DP formulation maintains a table F of size n x c. The table is constructed in row-major order. The algorithm first determines the maximum profit by using only the first object with knapsacks of different capacities. This corresponds to filling the first row of the table. Filling entries in subsequent rows requires two entries from the previous row: one from the same column and one from the column offset by the weight of the object. Thus, the computation of an arbitrary entry F [i, j] requires F [i - 1, j] and F [i - 1, j - w_{i}]. This is illustrated in Figure 12.4. Computing each entry takes constant time; the sequential run time of this algorithm is Q(nc). Figure 12.4. Computing entries of table F for the 0/1 knapsack problem. The computation of entry F[i, j] requires communication with processing elements containing entries F[i - 1, j] and F[i - 1, j - w_{i}].This formulation is a serial monadic formulation. The subproblems F [i, x] are organized into n levels for i = 1, 2, ..., n. Computation of problems in level i depends only on the subproblems at level i - 1. Hence the formulation is serial. The formulation is monadic because each of the two alternate solutions of F [i, x] depends on only one subproblem. Furthermore, dependencies between levels are sparse because a problem at one level depends only on two subproblems from previous level. Consider a parallel formulation of this algorithm on a CREW PRAM with c processing elements labeled P_{0} to P_{c}_{-1}. Processing element P_{r} _{-1} computes the r^{th} column of matrix F. When computing F [j, r] during iteration j, processing element P_{r} _{-1} requires the values F [j - 1, r] and F [j - 1, r - wj]. Processing element P_{r} _{-1} can read any element of matrix F in constant time, so computing F [j, r] also requires constant time. Therefore, each iteration takes constant time. Since there are n iterations, the parallel run time is Q(n). The formulation uses c processing elements, hence its processor-time product is Q(nc). Therefore, the algorithm is cost-optimal. Let us now consider its formulation on a distributed memory machine with c-processing elements. Table F is distributed among the processing elements so that each processing element is responsible for one column. This is illustrated in Figure 12.4. Each processing element locally stores the weights and profits of all objects. In the j^{th} iteration, for computing F [j, r] at processing element P_{r} _{-1}, F [j - 1, r] is available locally but F [j - 1, r - w_{j}] must be fetched from another processing element. This corresponds to the circular w_{j} -shift operation described in Section 4.6. The time taken by this circular shift operation on p processing elements is bounded by (t_{s} + t_{w}m) log p for a message of size m on a network with adequate bandwidth. Since the size of the message is one word and we have p = c, this time is given by (t_{s} + t_{w}) log c. If the sum and maximization operations take time t_{c}, then each iteration takes time t_{c} + (t_{s} + t_{w}) log c. Since there are n such iterations, the total time is given by O (n log c). The processor-time product for this formulation is O (nc log c); therefore, the algorithm is not cost-optimal. Let us see what happens to this formulation as we increase the number of elements per processor. Using p-processing elements, each processing element computes c/p elements of the table in each iteration. In the j^{th} iteration, processing element P_{0} computes the values of elements F [j, 1], ..., F [j, c/p], processing element P_{1} computes values of elements F [j, c/p + 1], ..., F [j, 2c/p], and so on. Computing the value of F [j, k], for any k, requires values F [j - 1, k] and F [j - 1, k - wj]. Required values of the F table can be fetched from remote processing elements by performing a circular shift. Depending on the values of w_{j} and p, the required nonlocal values may be available from one or two processing elements. Note that the total number of words communicated via these messages is c/p irrespective of whether they come from one or two processing elements. The time for this operation is at most (2t_{s} + t_{w}c/p) assuming that c/p is large and the network has enough bandwidth (Section 4.6). Since each processing element computes c/p such elements, the total time for each iteration is t_{c}c/p + 2t_{s} + t_{w}c/p. Therefore, the parallel run time of the algorithm for n iterations is n(t_{c}c/p + 2t_{s} + t_{w}c/p). In asymptotic terms, this algorithm's parallel run time is O (nc/p). Its processor-time product is O (nc), which is cost-optimal. There is an upper bound on the efficiency of this formulation because the amount of data that needs to be communicated is of the same order as the amount of computation. This upper bound is determined by the values of t_{w} and t_{c} (Problem 12.1). |