<< Chapter < Page | Chapter >> Page > |
Another example of the effect of loop reordering is a style of plan that we sometimes call vector recursion (unrelated to “vector-radix” FFTs [link] ). The basic idea is that, if one has a loop (vector-rank 1) of transforms, where the vectorstride is smaller than the transform size, it is advantageous to push the loop towards the leaves of the transform decomposition, whileotherwise maintaining recursive depth-first ordering, rather than looping “outside” the transform; i.e., apply the usual FFT to“vectors” rather than numbers. Limited forms of this idea have appeared for computing multiple FFTs on vector processors (wherethe loop in question maps directly to a hardware vector) [link] . For example, Cooley-Tukey produces a unit input -stride vector loop at the top-level DIT decomposition, but with a large output stride; this difference in strides makes it non-obvious whether vector recursion isadvantageous for the sub-problem, but for large transforms we often observe the planner to choose this possibility.
In-place 1d transforms (with no separate bit reversal pass) can be obtained as follows by a combination DIT and DIF plans "Cooley-Tukey plans" with transposes "Rank-0 plans" . First, the transform is decomposed via a radix- DIT plan into a vector of transforms of size , then these are decomposed in turn by a radix- DIF plan into a vector (rank 2) of transforms of size . These transforms of size have input and output at different places/strides in the original array, and so cannot be solvedindependently. Instead, an indirect plan "Indirect plans" is used to express the sub-problem as in-place transforms of size , followed or preceded by an rank-0 transform. The latter sub-problem is easily seen to be in-place transposes (ideally square, i.e. ). Related strategies for in-place transforms based on small transposes weredescribed in [link] , [link] , [link] , [link] ; alternating DIT/DIF, without concern for in-place operation, wasalso considered in [link] , [link] .
Given a problem and a set of possible plans, the basic principle behind the FFTW planner is straightforward: construct a plan foreach applicable algorithmic step, time the execution of these plans, and select the fastest one. Each algorithmic step may break theproblem into subproblems, and the fastest plan for each subproblem is constructed in the same way. These timing measurements can either beperformed at runtime, or alternatively the plans for a given set of sizes can be precomputed and loaded at a later time.
A direct implementation of this approach, however, faces an exponential explosion of the number of possible plans, and hence ofthe planning time, as increases. In order to reduce the planning time to a manageable level, we employ several heuristics to reduce thespace of possible plans that must be compared. The most important of these heuristics is dynamic programming [link] : it optimizes each sub-problem locally, independently of the larger context (so that the “best” plan for agiven sub-problem is re-used whenever that sub-problem is encountered). Dynamic programming is not guaranteed to find thefastest plan, because the performance of plans is context-dependent on real machines (e.g., the contents of the cache depend on the precedingcomputations); however, this approximation works reasonably well in practice and greatly reduces the planning time. Other approximations,such as restrictions on the types of loop-reorderings that are considered "Plans for higher vector ranks" , are described in [link] .
Notification Switch
Would you like to follow the 'Fast fourier transforms' conversation and receive update notifications?