13.1 Let the serial run time of an n-point FFT computation be t_{c}n log n. Consider its implementation on an architecture on which the parallel run time is (t_{c}n log n)/p + (t_{w}n log p)/p. Assume that t_{c} = 1and t_{w} = 0.2.

Write expressions for the speedup and efficiency.

What is the isoefficiency function if an efficiency of 0.6 is desired?

How will the isoefficiency function change (if at all) if an efficiency of 0.4 is desired?

Repeat parts 1 and 2 for the case in which t_{w} = 1 and everything else is the same.

13.2 [Tho83] Show that, while performing FFT on a square mesh of p processes by using any mapping of data onto the processes, there is at least one iteration in which the pairs of processes that need to communicate are at least links apart.

13.3 Describe the communication pattern of the binary-exchange algorithm on a linear array of p processes. What are the parallel run time, speedup, efficiency, and isoefficiency function of the binary-exchange algorithm on a linear array?

13.4 Show that, if t_{s} = 0, the isoefficiency function of the binary-exchange algorithm on a mesh is given by

Hint: Use Equation 13.10.

13.5 Prove that the maximum number of twiddle factors computed by any process in the parallel implementation of an n-point FFT using p processes is given by the recurrence relation given in Section 13.2.3.

13.6 Derive expressions for the parallel run time, speedup, and efficiency of the two-dimensional transpose algorithm described in Section 13.3.1 for an n-point FFT on a p-node two-dimensional mesh and a linear array of p nodes.

13.7 Ignoring t_{s}, by what factor should the communication bandwidth of a p-node mesh be increased so that it yields the same performance on the two-dimensional transpose algorithm for an n-point FFT on a p-node hypercube?

13.8 You are given the following sets of communication-related constants for a hypercube network: (i) t_{s} = 250, t_{w} = 1, (ii) t_{s} = 50, t_{w} = 1, (iii) t_{s} = 10, t_{w} =1, (iv) t_{s} = 2, t_{w} =1, and (v) t_{s} = 0, t_{w} = 1.

Given a choice among the binary-exchange algorithm and the two-, three-, four-, and five-dimensional transpose algorithms, which one would you use for n = 2^{15} and p = 2^{12} for each of the preceding sets of values of t_{s} and t_{w}?

Repeat part 1 for (a) n = 2^{12}, p = 2^{6}, and (b) n = 2^{20}, p = 2^{12}.

13.9 [GK93b] Consider computing an n-point FFT on a mesh. If the channel bandwidth grows at a rate of Q(p^{x}) (x > 0) with the number of nodes p in the mesh, show that the isoefficiency function due to communication overhead is Q(p^{0.5-}^{x}2^{2(}^{tw/tc}^{)}p^{0.5-}^{x}) and that due to concurrency is Q(p^{1+x} log p). Also show that the best possible isoefficiency for FFT on a mesh is Q(p^{1.5} log p), even if the channel bandwidth increases arbitrarily with the number of nodes in the network.