### 4.6 Circular Shift

Circular shift is a member of a broader class of global communication operations known as permutation. A permutation is a simultaneous, one-to-one data redistribution operation in which each node sends a packet of m words to a unique node. We define a circular q-shift as the operation in which node i sends a data packet to node (i + q) mod p in a p-node ensemble (0 < q < p). The shift operation finds application in some matrix computations and in string and image pattern matching.

#### 4.6.1 Mesh

The implementation of a circular q-shift is fairly intuitive on a ring or a bidirectional linear array. It can be performed by min{q , p - q} neighbor-to-neighbor communications in one direction. Mesh algorithms for circular shift can be derived by using the ring algorithm.

If the nodes of the mesh have row-major labels, a circular q-shift can be performed on a p-node square wraparound mesh in two stages. This is illustrated in Figure 4.22 for a circular 5-shift on a 4 x 4 mesh. First, the entire set of data is shifted simultaneously by (q mod ) steps along the rows. Then it is shifted by steps along the columns. During the circular row shifts, some of the data traverse the wraparound connection from the highest to the lowest labeled nodes of the rows. All such data packets must shift an additional step forward along the columns to compensate for the distance that they lost while traversing the backward edge in their respective rows. For example, the 5-shift in Figure 4.22 requires one row shift, a compensatory column shift, and finally one column shift.

##### Figure 4.22. The communication steps in a circular 5-shift on a 4 x 4 mesh. In practice, we can choose the direction of the shifts in both the rows and the columns to minimize the number of steps in a circular shift. For instance, a 3-shift on a 4 x 4 mesh can be performed by a single backward row shift. Using this strategy, the number of unit shifts in a direction cannot exceed .

Cost Analysis Taking into account the compensating column shift for some packets, the total time for any circular q-shift on a p-node mesh using packets of size m has an upper bound of #### 4.6.2 Hypercube

In developing a hypercube algorithm for the shift operation, we map a linear array with 2d nodes onto a d-dimensional hypercube. We do this by assigning node i of the linear array to node j of the hypercube such that j is the d-bit binary reflected Gray code (RGC) of i. Figure 4.23 illustrates this mapping for eight nodes. A property of this mapping is that any two nodes at a distance of 2i on the linear array are separated by exactly two links on the hypercube. An exception is i = 0 (that is, directly-connected nodes on the linear array) when only one hypercube link separates the two nodes.

##### Figure 4.23. The mapping of an eight-node linear array onto a three-dimensional hypercube to perform a circular 5-shift as a combination of a 4-shift and a 1-shift. To perform a q-shift, we expand q as a sum of distinct powers of 2. The number of terms in the sum is the same as the number of ones in the binary representation of q. For example, the number 5 can be expressed as 22 + 20. These two terms correspond to bit positions 0 and 2 in the binary representation of 5, which is 101. If q is the sum of s distinct powers of 2, then the circular q-shift on a hypercube is performed in s phases.

In each phase of communication, all data packets move closer to their respective destinations by short cutting the linear array (mapped onto the hypercube) in leaps of the powers of 2. For example, as Figure 4.23 shows, a 5-shift is performed by a 4-shift followed by a 1-shift. The number of communication phases in a q-shift is exactly equal to the number of ones in the binary representation of q. Each phase consists of two communication steps, except the 1-shift, which, if required (that is, if the least significant bit of q is 1), consists of a single step. For example, in a 5-shift, the first phase of a 4-shift (Figure 4.23(a)) consists of two steps and the second phase of a 1-shift (Figure 4.23(b)) consists of one step. Thus, the total number of steps for any q in a p-node hypercube is at most 2 log p - 1.

All communications in a given time step are congestion-free. This is ensured by the property of the linear array mapping that all nodes whose mutual distance on the linear array is a power of 2 are arranged in disjoint subarrays on the hypercube. Thus, all nodes can freely communicate in a circular fashion in their respective subarrays. This is shown in Figure 4.23(a), in which nodes labeled 0, 3, 4, and 7 form one subarray and nodes labeled 1, 2, 5, and 6 form another subarray.

The upper bound on the total communication time for any shift of m-word packets on a p-node hypercube is

Equation 4.11 We can reduce this upper bound to (ts + twm) log p by performing both forward and backward shifts. For example, on eight nodes, a 6-shift can be performed by a single backward 2-shift instead of a forward 4-shift followed by a forward 2-shift.

We now show that if the E-cube routing introduced in Section 4.5 is used, then the time for circular shift on a hypercube can be improved by almost a factor of log p for large messages. This is because with E-cube routing, each pair of nodes with a constant distance l (i l < p) has a congestion-free path (Problem 4.22) in a p-node hypercube with bidirectional channels. Figure 4.24 illustrates the non-conflicting paths of all the messages in circular q -shift operations for 1 q < 8 on an eight-node hypercube. In a circular q-shift on a p-node hypercube, the longest path contains log p - g(q) links, where g(q) is the highest integer j such that q is divisible by 2j (Problem 4.23). Thus, the total communication time for messages of length m is

Equation 4.12 ##### Figure 4.24. Circular q-shifts on an 8-node hypercube for 1 q < 8. 