<< Chapter < Page | Chapter >> Page > |
Even for one-dimensional DFTs, there is a common misperception that one should always choose power-of-two sizes if one cares aboutefficiency. Thanks to FFTW's code generator (described in "Generating Small FFT Kernels" ), we could afford to devote equal optimization effort to any with small factors (2, 3, 5, and 7 are good), instead of mostly optimizing powers of two likemany high-performance FFTs. As a result, to pick a typical example on the 3 GHz Core Duo processor of [link] , and both execute faster than . (And if there are factors one particularly cares about, one can generate code for them too.)
One initially missing feature was efficient support for large prime sizes; the conventional wisdom was that large-prime algorithms weremainly of academic interest, since in real applications (including ours) one has enough freedom to choose a highly composite transformsize. However, the prime-size algorithms are fascinating, so we implemented Rader's prime- algorithm [link] purely for fun, including it in FFTW 2.0 (released in 1998) as a bonus feature. The response wasastonishingly positive—even though users are (probably) never forced by their application to compute a prime-size DFT, it is rather inconvenient to always worry that collecting an unluckynumber of data points will slow down one's analysis by a factor of a million. The prime-size algorithms are certainly slower thanalgorithms for nearby composite sizes, but in interactive data-analysis situations the difference between 1 ms and 10 ms means little,while educating users to avoid large prime factors is hard.
Another form of flexibility that deserves comment has to do with a purely technical aspect of computer software. FFTW'simplementation involves some unusual language choices internally (the FFT-kernel generator, described in "Generating Small FFT Kernels" , is written in Objective Caml, a functional languageespecially suited for compiler-like programs), but its user-callable interface is purely in C with lowest-common-denominator datatypes(arrays of floating-point values). The advantage of this is that FFTW can be (and has been) called from almost any other programminglanguage, from Java to Perl to Fortran 77. Similar lowest-common-denominator interfaces are apparent in many otherpopular numerical libraries, such as LAPACK [link] . Language preferences arouse strong feelings, but this technical constraint meansthat modern programming dialects are best hidden from view for a numerical library.
Ultimately, very few scientific-computing applications should have performance as their top priority. Flexibility is often far more important,because one wants to be limited only by one's imagination, rather than by one's software, in the kinds of problems that can be studied.
There are many complexities of computer architectures that impact the optimization of FFT implementations, but one of the most pervasiveis the memory hierarchy. On any modern general-purpose computer, memory is arranged into a hierarchy of storage devices with increasingsize and decreasing speed: the fastest and smallest memory being the CPU registers, then two or three levels of cache, then the main-memoryRAM, then external storage such as hard disks. A hard disk is utilized by “out-of-core” FFT algorithms for very large [link] , but these algorithms appear to have been largely superseded in practice by both the gigabytes of memory now common on personal computers and,for extremely large , by algorithms for distributed-memory parallel computers. Most of these levels are managed automatically by the hardware to hold the most-recently-used data from the next levelin the hierarchy. This includes the registers: on current “x86” processors, the user-visible instruction set (with a smallnumber of floating-point registers) is internally translated at runtime to RISC-like “ -ops” with a much larger number of physical rename registers that are allocated automatically. There are many complications, however, such as limited cache associativity (which means that certain locations in memory cannot becached simultaneously) and cache lines (which optimize the cache for contiguous memory access), which are reviewed in numerous textbooks oncomputer architectures. In this section, we focus on the simplest abstract principles of memory hierarchies in order to grasptheir fundamental impact on FFTs.
Notification Switch
Would you like to follow the 'Fast fourier transforms' conversation and receive update notifications?