4.8 Summary
Table 4.1 summarizes the communication times for various collective communications operations discussed in this chapter. The time for onetoall broadcast, alltoone reduction, and the allreduce operations is the minimum of two expressions. This is because, depending on the message size m, either the algorithms described in Sections 4.1 and 4.3 or the ones described in Section 4.7 are faster. Table 4.1 assumes that the algorithm most suitable for the given message size is chosen. The communicationtime expressions in Table 4.1 have been derived in the earlier sections of this chapter in the context of a hypercube interconnection network with cutthrough routing. However, these expressions and the corresponding algorithms are valid for any architecture with a Q(p) crosssection bandwidth (Section 2.4.4). In fact, the terms associated with t_{w} for the expressions for all operations listed in Table 4.1, except alltoall personalized communication and circular shift, would remain unchanged even on ring and mesh networks (or any kd mesh network) provided that the logical processes are mapped onto the physical nodes of the network appropriately. The last column of Table 4.1 gives the asymptotic crosssection bandwidth required to perform an operation in the time given by the second column of the table, assuming an optimal mapping of processes to nodes. For large messages, only alltoall personalized communication and circular shift require the full Q(p) crosssection bandwidth. Therefore, as discussed in Section 2.5.1, when applying the expressions for the time of these operations on a network with a smaller crosssection bandwidth, the t_{w} term must reflect the effective bandwidth. For example, the bisection width of a pnode square mesh is Q and that of a pnode ring is Q(1). Therefore, while performing alltoall personalized communication on a square mesh, the effective perword transfer time would be Q times the t_{w} of individual links, and on a ring, it would be Q(p) times the t_{w} of individual links.
Table 4.1. Summary of communication times of various operations discussed in Sections 4.14.7 on a hypercube interconnection network. The message size for each operation is m and the number of nodes is p.
Onetoall broadcast,
Alltoone reduction 
min((t_{s} + t_{w}m) log p, 2(t_{s} log p + t_{w}m)) 
Q(1) 
Alltoall broadcast,
Alltoall reduction 
t_{s} log p + t_{w}m(p  1) 
Q(1) 
Allreduce 
min((t_{s} + t_{w}m) log p, 2(t_{s} log p + t_{w}m)) 
Q(1) 
Scatter, Gather 
t_{s} log p + t_{w}m(p  1) 
Q(1) 
Alltoall personalized 
(t_{s} + t_{w}m)(p  1) 
Q(p) 
Circular shift 
t_{s} + t_{w}m 
Q(p) 
Table 4.2. MPI names of the various operations discussed in this chapter.
Onetoall broadcast 
MPI_Bcast 
Alltoone reduction 
MPI_Reduce 
Alltoall broadcast 
MPI_Allgather 
Alltoall reduction 
MPI_Reduce_scatter 
Allreduce 
MPI_Allreduce 
Gather 
MPI_Gather 
Scatter 
MPI_Scatter 
Alltoall personalized 
MPI_Alltoall 
The collective communications operations discussed in this chapter occur frequently in many parallel algorithms. In order to facilitate speedy and portable design of efficient parallel programs, most parallel computer vendors provide prepackaged software for performing these collective communications operations. The most commonly used standard API for these operations is known as the Message Passing Interface, or MPI. Table 4.2 gives the names of the MPI functions that correspond to the communications operations described in this chapter.
