11.7 Bibliographic Remarks
Extensive literature is available on search algorithms for discrete optimization techniques such as branch-and-bound and heuristic search [KK88a, LW66, Pea84]. The relationship between branch-and-bound search, dynamic programming, and heuristic search techniques in artificial intelligence is explored by Kumar and Kanal [KK83, KK88b]. The average time complexity of heuristic search algorithms for many problems is shown to be polynomial by Smith [Smi84] and Wilf [Wil86]. Extensive work has been done on parallel formulations of search algorithms. We briefly outline some of these contributions.
Parallel Depth-First Search Algorithms
Many parallel algorithms for DFS have been formulated [AJM88, FM87, KK94, KGR94, KR87b, MV87, Ran91, Rao90, SK90, SK89, Vor87a]. Load balancing is the central issue in parallel DFS. In this chapter, distribution of work in parallel DFS was done using stack splitting [KGR94, KR87b]. An alternative scheme for work-distribution is node splitting, in which only a single node is given out [FK88, FTI90, Ran91]
This chapter discussed formulations of state-space search in which a processor requests work when it goes idle. Such load-balancing schemes are called receiver-initiated schemes. In other load-balancing schemes, a processor that has work gives away part of its work to another processor (with or without receiving a request). These schemes are called sender-initiated schemes.
Several researchers have used receiver-initiated load-balancing schemes in parallel DFS [FM87, KR87b, KGR94]. Kumar et al. [KGR94] analyze these load-balancing schemes including global round robin, random polling, asynchronous round robin, and nearest neighbor. The description and analysis of these schemes in Section 11.4 is based on the papers by Kumar et al. [KGR94, KR87b].
Parallel DFS using sender-initiated load balancing has been proposed by some researchers [FK88, FTI90, PFK90, Ran91, SK89]. Furuichi et al. propose the single-level and multilevel sender-based schemes [FTI90]. Kimura and Nobuyuki [KN91] presented the scalability analysis of these schemes. Ferguson and Korf [FK88, PFK90] present a load-balancing scheme called distributed tree search (DTS).
Other techniques using randomized allocation have been presented for parallel DFS of state-space trees [KP92, Ran91, SK89, SK90]. Issues relating to granularity control in parallel DFS have also been explored [RK87, SK89].
Saletore and Kale [SK90] present a formulation of parallel DFS in which nodes are assigned priorities and are expanded accordingly. They show that the search overhead factor of this prioritized DFS formulation is very close to one, allowing it to yield consistently increasing speedups with an increasing number of processors for sufficiently large problems.
In some parallel formulations of depth-first search, the state space is searched independently in a random order by different processors [JAM87, JAM88]. Challou et al. [CGK93] and Ertel [Ert92] show that such methods are useful for solving robot motion planning and theorem proving problems, respectively.
Most generic DFS formulations apply to depth-first branch-and-bound and IDA*. Some researchers have specifically studied parallel formulations of depth-first branch-and-bound [AKR89, AKR90, EDH80]. Many parallel formulations of IDA* have been proposed [RK87, RKR87, KS91a, PKF92, MD92].
Most of the parallel DFS formulations are suited only for MIMD computers. Due to the nature of the search problem, SIMD computers were considered inherently unsuitable for parallel search. However, work by Frye and Myczkowski [FM92], Powley et al. [PKF92], and Mahanti and Daniels [MD92] showed that parallel depth-first search techniques can be developed even for SIMD computers. Karypis and Kumar [KK94] presented parallel DFS schemes for SIMD computers that are as scalable as the schemes for MIMD computers.
Several researchers have experimentally evaluated parallel DFS. Finkel and Manber [FM87] present performance results for problems such as the traveling salesman problem and the knight's tour for the Crystal multicomputer developed at the University of Wisconsin. Monien and Vornberger [MV87] show linear speedups on a network of transputers for a variety of combinatorial problems. Kumar et al. [AKR89, AKR90, AKRS91, KGR94] show linear speedups for problems such as the 15-puzzle, tautology verification, and automatic test pattern generation for various architectures such as a 128-processor BBN Butterfly, a 128-processor Intel iPSC, a 1024-processor nCUBE 2, and a 128-processor Symult 2010. Kumar, Grama, and Rao [GKR91, KGR94, KR87b, RK87] have investigated the scalability and performance of many of these schemes for hypercubes, meshes, and networks of workstations. Experimental results in Section 11.4.5 are taken from the paper by Kumar, Grama, and Rao [KGR94].
Parallel formulations of DFBB have also been investigated by several researchers. Many of these formulations are based on maintaining a current best solution, which is used as a global bound. It has been shown that the overhead for maintaining the current best solution tends to be a small fraction of the overhead for dynamic load balancing. Parallel formulations of DFBB have been shown to yield near linear speedups for many problems and architectures [ST95, LM97, Eck97, Eck94, AKR89].
Many researchers have proposed termination detection algorithms for use in parallel search. Dijkstra [DSG83] proposed the ring termination detection algorithm. The termination detection algorithm based on weights, discussed in Section 11.4.4, is similar to the one proposed by Rokusawa et al. [RICN88]. Dutt and Mahapatra [DM93] discuss the termination detection algorithm based on minimum spanning trees.
Parallel Formulations of Alpha-Beta Search
Alpha-beta search is essentially a depth-first branch-and-bound search technique that finds an optimal solution tree of an AND/OR graph [KK83, KK88b]. Many researchers have developed parallel formulations of alpha-beta search [ABJ82, Bau78, FK88, FF82, HB88, Lin83, MC82, MFMV90, MP85, PFK90]. Some of these methods have shown reasonable speedups on dozens of processors [FK88, MFMV90, PFK90].
The utility of parallel processing has been demonstrated in the context of a number of games, and in particular, chess. Work on large scale parallel a - b search led to the development of Deep Thought [Hsu90] in 1990. This program was capable of playing chess at grandmaster level. Subsequent advances in the use of dedicated hardware, parallel processing, and algorithms resulted in the development of IBM's Deep Blue [HCH95, HG97] that beat the reigning world champion Gary Kasparov. Feldmann et al. [FMM94] developed a distributed chess program that is acknowledged to be one of the best computer chess players based entirely on general purpose hardware.
Parallel Best-First Search
Many researchers have investigated parallel formulations of A* and branch-and-bound algorithms [KK84, KRR88, LK85, MV87, Qui89, HD89a, Vor86, WM84, Rao90, GKP92, AM88, CJP83, KB57, LP92, Rou87, PC89, PR89, PR90, PRV88, Ten90, MRSR92, Vor87b, Moh83, MV85, HD87]. All these formulations use different data structures to store the open list. Some formulations use the centralized strategy [Moh83, HD87]; some use distributed strategies such as the random communication strategy [Vor87b, Dal87, KRR88], the ring communication strategy [Vor86, WM84]; and the blackboard communication strategy [KRR88]. Kumar et al. [KRR88] experimentally evaluated the centralized strategy and some distributed strategies in the context of the traveling salesman problem, the vertex cover problem and the 15-puzzle. Dutt and Mahapatra [DM93, MD93] have proposed and evaluated a number of other communication strategies.
Manzini analyzed the hashing technique for distributing nodes in parallel graph search [MS90]. Evett et al. [EHMN90] proposed parallel retracting A* (PRA*), which operates under limited-memory conditions. In this formulation, each node is hashed to a unique processor. If a processor receives more nodes than it can store locally, it retracts nodes with poorer heuristic values. These retracted nodes are reexpanded when more promising nodes fail to yield a solution.
Karp and Zhang [KZ88] analyze the performance of parallel best-first branch-and-bound (that is, A*) by using a random distribution of nodes for a specific model of search trees. Renolet et al. [RDK89] use Monte Carlo simulations to model the performance of parallel best-first search. Wah and Yu [WY85] present stochastic models to analyze the performance of parallel formulations of depth-first branch-and-bound and best-first branch-and-bound search.
Bixby [Bix91] presents a parallel branch-and-cut algorithm to solve the symmetric traveling salesman problem. He also presents solutions of the LP relaxations of airline crew-scheduling models. Miller et al. [Mil91] present parallel formulations of the best-first branch-and-bound technique for solving the asymmetric traveling salesman problem on heterogeneous network computer architectures. Roucairol [Rou91] presents parallel best-first branch-and-bound formulations for shared-address-space computers and uses them to solve the multiknapsack and quadratic-assignment problems.
Speedup Anomalies in Parallel Formulations of Search Algorithms
Many researchers have analyzed speedup anomalies in parallel search algorithms [IYF79, LS84, Kor81, LW86, MVS86, RKR87]. Lai and Sahni [LS84] present early work quantifying speedup anomalies in best-first search. Lai and Sprague [LS86] present enhancements and extensions to this work. Lai and Sprague [LS85] also present an analytical model and derive characteristics of the lower-bound function for which anomalies are guaranteed not to occur as the number of processors is increased. Li and Wah [LW84, LW86] and Wah et al. [WLY84] investigate dominance relations and heuristic functions and their effect on detrimental (speedup of < 1using p processors) and acceleration anomalies. Quinn and Deo [QD86] derive an upper bound on the speedup attainable by any parallel formulation of the branch-and-bound algorithm using the best-bound search strategy. Rao and Kumar [RK88b, RK93] analyze the average speedup in parallel DFS for two separate models with and without heuristic ordering information. They show that the search overhead factor in these cases is at most one. Section 11.6.1 is based on the results of Rao and Kumar [RK93].
Role of Heuristics
Heuristics form the most important component of search techniques, and parallel formulations of search algorithms must be viewed in the context of these heuristics. In BFS techniques, heuristics focus search by lowering the effective branching factor. In DFBB methods, heuristics provide better bounds, and thus serve to prune the search space.
Often, there is a tradeoff between the strength of the heuristic and the effective size of search space. Better heuristics result in smaller search spaces but are also more expensive to compute. For example, an important application of strong heuristics is in the computation of bounds for mixed integer programming (MIP). Mixed integer programming has seen significant advances over the years [JNS97]. Whereas 15 years back, MIP problems with 100 integer variables were considered challenging, today, many problems with up to 1000 integer variables can be solved on workstation class machines using branch-and-cut methods. The largest known instances of TSPs and QAPs have been solved using branch-and-bound with powerful heuristics [BMCP98, MP93]. The presence of effective heuristics may prune the search space considerably. For example, when Padberg and Rinaldi introduced the branch-and-cut algorithm in 1987, they used it to solve a 532 city TSP, which was the largest TSP solved optimally at that time. Subsequent improvements to the method led to the solution of a a 2392 city problem [PR91]. More recently, using cutting planes, problems with over 7000 cities have been solved [JNS97] on serial machines. However, for many problems of interest, the reduced search space still requires the use of parallelism [BMCP98, MP93, Rou87, MMR95]. Use of powerful heuristics combined with effective parallel processing has enabled the solution of extremely large problems [MP93]. For example, QAP problems from the Nugent and Eschermann test suites with up to 4.8 x 1010 nodes (Nugent22 with Gilmore-Lawler bound) were solved on a NEC Cenju-3 parallel computer in under nine days [BMCP98]. Another dazzling demonstration of this was presented by the IBM Deep Blue. Deep blue used a combination of dedicated hardware for generating and evaluating board positions and parallel search of the game tree using an IBM SP2 to beat the current world chess champion, Gary Kasparov [HCH95, HG97].
Heuristics have several important implications for exploiting parallelism. Strong heuristics narrow the state space and thus reduce the concurrency available in the search space. Use of powerful heuristics poses other computational challenges for parallel processing as well. For example, in branch-and-cut methods, a cut generated at a certain state may be required by other states. Therefore, in addition to balancing load, the parallel branch-and-cut formulation must also partition cuts among processors so that processors working in certain LP domains have access to the desired cuts [BCCL95, LM97, Eck97].
In addition to inter-node parallelism that has been discussed up to this point, intra-node parallelism can become a viable option if the heuristic is computationally expensive. For example, the assignment heuristic of Pekny et al. has been effectively parallelized for solving large instances of TSPs [MP93]. If the degree of inter-node parallelism is small, this form of parallelism provides a desirable alternative. Another example of this is in the solution of MIP problems using branch-and-cut methods. Branch-and-cut methods rely on LP relaxation for generating lower bounds of partially instantiated solutions followed by generation of valid inequalities [JNS97]. These LP relaxations constitute a major part of the overall computation time. Many of the industrial codes rely on Simplex to solve the LP problem since it can adapt the solution to added rows and columns. While interior point methods are better suited to parallelism, they tend to be less efficient for reoptimizing LP solutions with added rows and columns (in branch-and-cut methods). LP relaxation using Simplex has been shown to parallelize well on small numbers of processors but efforts to scale to larger numbers of processors have not been successful. LP based branch and bound methods have also been used for solving quadratic assignment problems using iterative solvers such as preconditioned Conjugate Gradient to approximately compute the interior point directions [PLRR94]. These methods have been used to compute lower bounds using linear programs with over 150,000 constraints and 300,000 variables for solving QAPs. These and other iterative solvers parallelize very effectively to a large number of processors. A general parallel framework for computing heuristic solutions to problems is presented by Pramanick and Kuhl [PK95].