<< Chapter < Page | Chapter >> Page > |
Some processors have special hardware accelerators or co-processors specifically designed to accelerate FFT computations.For example, AMI Semiconductor's Toccata ultra-low-power DSP microprocessor family, which is widely used in digital hearing aids, have on-chip FFT accelerators; it is always faster and more power-efficient to use such accelerators and whatever radix they prefer.
In a surprising number of applications, almost all of the computations are FFTs.A number of special-purpose chips are designed to specifically compute FFTs, and are used in specialized high-performanceapplications such as radar systems. Other systems, such as OFDM -based communications receivers, have special FFT hardware builtinto the digital receiver circuit. Such hardware can run many times faster, with much less powerconsumption, than FFT programs on general-purpose processors.
Cache misses or excessive data movement between registers and memory can greatly slow down an FFT computation.Efficient programs such as the FFTW package are carefully designed to minimize these inefficiences. In-place algorithms reuse the data memory throughout the transform, which can reduce cache misses forlonger lengths.
FFTs of real-valued signals require only half as many computations as with complex-valued data. There are several methods for reducing the computation,which are described in more detail in Sorensen et al.
Occasionally only certain DFT frequencies are needed, the input signal values are mostly zero, the signalis real-valued (as discussed above), or other special conditions exist for which faster algorithms can bedeveloped. Sorensen and Burrus describe slightly faster algorithms for pruned or zero-padded data. Goertzel's algorithm is useful when only a few DFT outputs are needed.The running FFT can be faster when DFTs of highly overlapped blocks of data are needed,as in a spectrogram .
Higher-radix algorithms, such as the radix-4 , radix-8, or split-radix FFTs, require fewer computations and can produce modest but worthwhile savings.Even the split-radix FFT reduces the multiplications by only 33% and the additions by a much lesser amount relative to the radix-2 FFTs ; significant improvements in program speed are oftendue to implicit loop-unrolling or other compiler benefits than from the computational reduction itself!
Bit-reversing the input or output data can consume several percent of the totalrun-time of an FFT program. Several fast bit-reversal algorithms have been developedthat can reduce this to two percent or less, including the method published by D.M.W. Evans .
When FFTs first became widely used, hardware multipliers were relatively rare on digital computers, and multiplicationsgenerally required many more cycles than additions. Methods to reduce multiplications, even at the expenseof a substantial increase in additions, were often beneficial. The prime factor algorithms and the Winograd Fourier transform algorithms , which required fewer multiplies and considerably more additionsthan the power-of-two-length algorithms , were developed during this period.Current processors generally have high-speed pipelined hardware multipliers, so trading multiplies for additionsis often no longer beneficial. In particular, most machines now support single-cyclemultiply-accumulate (MAC) operations, so balancing the number of multiplies and adds and combining them intosingle-cycle MACs generally results in the fastest code. Thus, the prime-factor and Winograd FFTs are rarely usedtoday unless the application requires FFTs of a specific length.
It is possible to implement a complex multiply with 3 real multiplies and 5 real adds rather than the usual4 real multiplies and 2 real adds: but alernatively In an FFT, and come entirely from the twiddle factors,so they can be precomputed and stored in a look-up table. This reduces the cost of the complex twiddle-factor multiplyto 3 real multiplies and 3 real adds, or one less and one more, respectively, than the conventional 4/2 computation.
Certain twiddle factors, namely , , , , , etc., can be implemented with no additional operations, or with fewer real operations thana general complex multiply. Programs that specially implement such butterflies in the mostefficient manner throughout the algorithm can reduce the computational cost by up to several multiplies and additions in a length- FFT.
When optimizing FFTs for speed, it can be important to maintain perspective on the benefits that can be expected fromany given optimization. The following list categorizes the various techniques by potentialbenefit; these will be somewhat situation- and machine-dependent, but clearlyone should begin with the most significant and put the most effort where the pay-off is likely to be largest.
Notification Switch
Would you like to follow the 'The dft, fft, and practical spectral analysis' conversation and receive update notifications?