<< Chapter < Page | Chapter >> Page > |
In the simplification phase, genfft applies localrewriting rules to each node of the dag in order to simplify it. This phase performs algebraic transformations (such as eliminatingmultiplications by 1) and common-subexpression elimination. Although such transformations can be performed by a conventionalcompiler to some degree, they can be carried out here to a greater extent because genfftcan exploit the specific problem domain. For example, two equivalent subexpressions can always be detected, even ifthe subexpressions are written in algebraically different forms, because all subexpressions compute linear functions. Also, genfftcan exploit the property that network transposition (reversing the direction of every edge) computes the transposed linear operation [link] , in order to transpose the network, simplify, and then transpose back—this turns out to exposeadditional common subexpressions [link] . In total, these simplifications are sufficiently powerful to derive DFT algorithmsspecialized for real and/or symmetric data automatically from the complex algorithms. For example, it is known that when the input of aDFT is real (and the output is hence conjugate-symmetric), one can save a little over a factor of two in arithmetic cost by specializingFFT algorithms for this case—with genfft , this specializationcan be done entirely automatically, pruning the redundant operations from the dag, to match the lowest known operation count for areal-input FFT starting only from the complex-data algorithm [link] , [link] . We take advantage of this property to help us implement real-data DFTs [link] , [link] , to exploit machine-specific “SIMD” instructions "SIMD instructions" [link] , and to generate codelets for the discrete cosine (DCT) and sine (DST) transforms [link] , [link] . Furthermore, by experimentation we have discovered additionalsimplifications that improve the speed of the generated code. One interesting example is the elimination of negative constants [link] : multiplicative constants in FFT algorithms often come inpositive/negative pairs, but every C compiler we are aware of will generate separate load instructions for positive and negative versionsof the same constants. Floating-point constants must be stored explicitly in memory; they cannot be embedded directly into theCPU instructions like integer “immediate” constants. We thus obtained a 10–15% speedup by making all constants positive, whichinvolves propagating minus signs to change additions into subtractions or vice versa elsewhere in the dag (a daunting task if it had to bedone manually for tens of thousands of lines of code).
In the scheduling phase, genfft produces a topologicalsort of the dag (a schedule ). The goal of this phase is to find a schedule such that a C compiler can subsequently perform a goodregister allocation. The scheduling algorithm used by genfft offers certain theoretical guarantees because it has its foundationsin the theory of cache-oblivious algorithms [link] (here, the registers are viewed as a form of cache), as described in "Memory strategies in FFTW" . As a practical matter, one consequence of this scheduler is that FFTW's machine-independent codelets are no slowerthan machine-specific codelets generated by SPIRAL [link] .
Notification Switch
Would you like to follow the 'Fast fourier transforms' conversation and receive update notifications?