1.4 Efficient fft algorithm and programming tricks (Page 2/2)

Page 2 / 2

Special hardware

Some processors have special hardware accelerators or co-processors specifically designed to accelerate FFT computations.For example, AMI Semiconductor's Toccata ultra-low-power DSP microprocessor family, which is widely used in digital hearing aids, have on-chip FFT accelerators; it is always faster and more power-efficient to use such accelerators and whatever radix they prefer.

In a surprising number of applications, almost all of the computations are FFTs.A number of special-purpose chips are designed to specifically compute FFTs, and are used in specialized high-performanceapplications such as radar systems. Other systems, such as OFDM -based communications receivers, have special FFT hardware builtinto the digital receiver circuit. Such hardware can run many times faster, with much less powerconsumption, than FFT programs on general-purpose processors.

Effective memory management

Cache misses or excessive data movement between registers and memory can greatly slow down an FFT computation.Efficient programs such as the FFTW package are carefully designed to minimize these inefficiences. In-place algorithms reuse the data memory throughout the transform, which can reduce cache misses forlonger lengths.

Real-valued ffts

FFTs of real-valued signals require only half as many computations as with complex-valued data. There are several methods for reducing the computation,which are described in more detail in Sorensen et al.

Use DFT symmetry properties to do two real-valued DFTs at once with one FFT program
Perform one stage of the radix-2 decimation-in-time decomposition and compute the two length- $N 2$ DFTs using the above approach.
Use a direct real-valued FFT algorithm; see H.V. Sorensen et.al.

Special cases

Occasionally only certain DFT frequencies are needed, the input signal values are mostly zero, the signalis real-valued (as discussed above), or other special conditions exist for which faster algorithms can bedeveloped. Sorensen and Burrus describe slightly faster algorithms for pruned or zero-padded data. Goertzel's algorithm is useful when only a few DFT outputs are needed.The running FFT can be faster when DFTs of highly overlapped blocks of data are needed,as in a spectrogram .

Higher-radix algorithms

Higher-radix algorithms, such as the radix-4 , radix-8, or split-radix FFTs, require fewer computations and can produce modest but worthwhile savings.Even the split-radix FFT reduces the multiplications by only 33% and the additions by a much lesser amount relative to the radix-2 FFTs ; significant improvements in program speed are oftendue to implicit loop-unrolling or other compiler benefits than from the computational reduction itself!

Fast bit-reversal

Bit-reversing the input or output data can consume several percent of the totalrun-time of an FFT program. Several fast bit-reversal algorithms have been developedthat can reduce this to two percent or less, including the method published by D.M.W. Evans .

Trade additions for multiplications

When FFTs first became widely used, hardware multipliers were relatively rare on digital computers, and multiplicationsgenerally required many more cycles than additions. Methods to reduce multiplications, even at the expenseof a substantial increase in additions, were often beneficial. The prime factor algorithms and the Winograd Fourier transform algorithms , which required fewer multiplies and considerably more additionsthan the power-of-two-length algorithms , were developed during this period.Current processors generally have high-speed pipelined hardware multipliers, so trading multiplies for additionsis often no longer beneficial. In particular, most machines now support single-cyclemultiply-accumulate (MAC) operations, so balancing the number of multiplies and adds and combining them intosingle-cycle MACs generally results in the fastest code. Thus, the prime-factor and Winograd FFTs are rarely usedtoday unless the application requires FFTs of a specific length.

It is possible to implement a complex multiply with 3 real multiplies and 5 real adds rather than the usual4 real multiplies and 2 real adds: $C S X Y C X S Y C Y S X$ but alernatively $Z C X Y$ $D C S$ $E C S$ $C X S Y E Y Z$ $C Y S X D X Z$ In an FFT, $D$ and $E$ come entirely from the twiddle factors,so they can be precomputed and stored in a look-up table. This reduces the cost of the complex twiddle-factor multiplyto 3 real multiplies and 3 real adds, or one less and one more, respectively, than the conventional 4/2 computation.

Special butterflies

Certain twiddle factors, namely $W_{N}^{0} 1$ , $W_{N}^{\frac{N}{2}}$ , $W_{N}^{\frac{N}{4}}$ , $W_{N}^{\frac{N}{8}}$ , $W_{N}^{\frac{3 N}{8}}$ , etc., can be implemented with no additional operations, or with fewer real operations thana general complex multiply. Programs that specially implement such butterflies in the mostefficient manner throughout the algorithm can reduce the computational cost by up to several $N$ multiplies and additions in a length- $N$ FFT.

Practical perspective

When optimizing FFTs for speed, it can be important to maintain perspective on the benefits that can be expected fromany given optimization. The following list categorizes the various techniques by potentialbenefit; these will be somewhat situation- and machine-dependent, but clearlyone should begin with the most significant and put the most effort where the pay-off is likely to be largest.

Methods to speed up computation of dfts

Tremendous savings
- FFT ( $N 2 logbase --> N$ savings)
Substantial savings
( 2 )
- Table lookup of cosine/sine
- Compiler tricks/good programming
- Assembly-language programming
- Special-purpose hardware
- Real-data FFT for real data (factor of 2)
- Special cases
Minor savings
- radix-4 , split-radix (-10% - +30%)
- special butterflies
- 3-real-multiplication complex multiply
- Fast bit-reversal (up to 6%)

On general-purpose machines, computation is only part of the total run time. Address generation, indexing, datashuffling, and memory access take up much or most of the cycles.

A well-written radix-2 program will run much faster than a poorly written split-radix program!

<< Chapter < Page Page > Chapter >>

Read also:

Get Jobilize Job Search Mobile App in your pocket Now!

100% Free Mobile Applications
Receive real-time job alerts and never miss the right job again

Source: OpenStax, The dft, fft, and practical spectral analysis. OpenStax CNX. Feb 22, 2007 Download for free at http://cnx.org/content/col10281/1.2

Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'The dft, fft, and practical spectral analysis' conversation and receive update notifications?

Ask

	8 Biology 08 Photosynthesis MCQ By OpenStax Start Quiz
	13 Dr Garry GI Ruminants quiz By Brooke Delaney Start Exam
	4 Psychology MCQ 2009 Final Exam By John Gabrieli Start Exam
	28 Biology 28 Invertebrates MCQ By OpenStax Start Quiz
	Business fundamentals By OpenStax Read Online Course
	Introduction to Mechanics MCQ By Saylor Foundation Start Quiz
	2 Biology 02 The Chemical Foundation of Life MCQ By OpenStax Start Quiz
	13 Lec:13 Hypothesis Testing P-values By Janet Forrester Start Quiz
	7 Sociology 07 Deviance, Crime, Social Control MCQ By OpenStax Start Quiz
	Elementary algebra By OpenStax Read Online Course