<< Chapter < Page | Chapter >> Page > |
In “estimate” mode, FFTW minimizes a heuristic cost that is a function of a particular configuration's count of floating point operations and extraneous memory operations (forbuffering and transposes). Compared to patient mode, the runtime of the planner is reduced by several orders of magnitude, at the expense of runtimeperformance while executing the plan. For executing plans of 1D complex transforms on a PowerPC G5, the median and peak difference in runtimeperformance between patient and estimate modes was 20% and 72%, respectively. This result is used by Frigo and Johnson to support the hypothesis that thereis no longer any correlation between operation counts and runtime performance on modern machines [link] .
Frigo and Johnson discuss a small number of planner solutions in their 2005 paper on the design of FFTW3 [link] , and conclude that “we do not really understand the planner's choices because we cannot predict what planswill be produced. Indeed, this is the whole point of implementing a planner.” They do not mention the use of more rigorous methods, such as machine learning, for the purposes of predicting performance.
FFTW supports computation of complex DFTs with SIMD extensions by means of two-way parallel computation of real DFTs [link] . The Vienna MAP vectorizer [link] , [link] , [link] has also been coupled with FFTW to produce a high-performance FFT library for the IBM Blue Gene/L supercomputer [link] that is up to 80% faster than the best-performing scalar FFT codes generated by FFTW [link] .
In 1997, Daniel Bernstein noticed that it was not difficult to write code that out-performed FFTW [link] . He had written 86 lines of unscheduled code that computed a size 256 single precisiontransform in about 35000 Pentium cycles – faster than FFTW. After spending a few more days doing “some casual instruction scheduling,” he could compute the same transform with about 24000Pentium cycles (ibid.).
These performance results directly contradicted the assumption that predicated FFTW: that it was too hard to predict the performance of FFT code on modernmicroprocessors. Development of djbfft continued until 1999, and it had succeeded in becoming the fastest library for computing FFTs on most Pentium and SPARC machines.
Bernstein's FFT is notable for having been the first publicly available library to exploit the advantages of the conjugate-pair or “-1 exponent” algorithm. After Bernstein demonstrated the advantages of the algorithm in djbfft, Frigo and Johnson followed with an implementation in FFTW [link] .
SPIRAL [link] , [link] , [link] attempts to automatically optimize code for signal processing functions such as the discrete Fourier transform. SPIRAL's goal is to automatically optimize signal processing functions at the push of a button, with results that are as good as hand-optimized codes.
In contrast to FFTW, SPIRAL's optimization is performed at compile time, and thus the generated code is less portable. Another point of difference is in thesearch methods: while FFTW uses dynamic programming, SPIRAL uses a wide range of techniques that include machine learning [link] , [link] .
Notification Switch
Would you like to follow the 'Computing the fast fourier transform on simd microprocessors' conversation and receive update notifications?