E-Book Overview
The most comprehensive treatment of FFTs to date. Van Loan captures the interplay between mathematics and the design of effective numerical algorithms--a critical connection as more advanced machines become available. A stylized Matlab notation, which is familiar to those engaged in high-performance computing, is used. The Fast Fourier Transform (FFT) family of algorithms has revolutionized many areas of scientific computation. The FFT is one of the most widely used algorithms in science and engineering, with applications in almost every discipline. This volume is essential for professionals interested in linear algebra as well as those working with numerical methods. The FFT is also a great vehicle for teaching key aspects of scientific computing.
E-Book Content
Kronecker Product Properties
Kronl
If A, B, C, and D are matrices, then (A® B)(C ® D) = (AC)®(BD), assuming that the ordinary multiplications AC and BD are defined.
Kron2
If A and B are nonsingular matrices, then A® B is nonsingvlar and
Kron3
If A and B are matrices, then (A ® B)
Kron4
If P and Q are permutations, then so is P® Q.
Kron5
Suppose n = re. If A 2, then Fn is not Hermitian. Indeed, F^ — F% = Fn . However, if n = 1 or 2, then Fn is real and so FI and F? are Hermitian. Two vectors x, y £ n we have
If q — p, then npq — n. Otherwise, w 4 ~ p ^ 1, and it follows from the equation
that /ip, = 0 whenever We say that Q C n x n is unitary if Q-1 = QH. From Theorem 1.1.2 we see tha Fn/\/n is unitary. Thus, the DFT is a scaled unitary transformation on d, then the time required is approximately given by
34
CHAPTER 1. THE RADix-2 FRAMEWORKS
Let us apply this model to (1.4.8). However, before we do this we clarify the exact nature of the underlying vector operations that are carried out by the inner loop by rewriting the algorithm as follows:
Notice that there are ten length-!* vector operations to perform each pass through the loop. It follows that the whole process requires time
for completion. Notice that the start-up factor a has a greater bearing upon performance when L+ is small.
1.4.11
The Long Weight Vector
When we consider the overall Cooley-Tukey procedure, we see that the weights associated with Aq are a subset of the weights associated with Aq+\. In particular, since n /2, the weight vector for At. Unfortunately, this results in a power-of-two stride access into u;n/2- This is because the scaling operation which arises in either Algorithm 1.4.8 or 1.4.9, transforms to where j = n/L. A way around this difficulty suggested by Bailey (1987) is to precompute the long weight vector «4° defined by
1.4. WEIGHT AND BUTTERFLY COMPUTATIONS
35
This (n — 1)-vector is just a stacking of the weight vectors wu. In particular, if L — 2 and L, = 2*-1, then WL. = w(rl°n9\L. - l:L - 1) With Wn°ng available, the above r computation becomes.
and unit stride access is achieved. It is clear that Wn°n9) can be constructed in O(n) flops using any of Algorithms 1.4.1-1.4.6. Problems Pi.4.1 Using the identities
prove (1.4.2) and (1.4.5). PI.4.2 Show that if j = (6 t _i •••6160)2, then WL.(J) is the consequence of bt-i + • • • + 61 + 60 complex multiplications in Algorithm 1.4.3. Pi.4.3 Develop a complete algorithm for u / n " 9 ' that uses Algorithm 1.4.3 to compute the individual weight vectors. How many flops are required? PI.4.4 Using the concepts of §1.4.10, analyze the performance of Algorithm 1.4.3. Notes and References for Section Interleaved memory systems, vectorization models, and various pipelining issues are discussed in J.J. Dongarra, I.S. Duff, D.C. Sorensen, and H.A. van der Vorst (1991). Solving Linear Systems on Vector and Shared Memory Computers