<< Chapter < Page | Chapter >> Page > |
Because access to memory is in many cases the slowest part of the computer, especially compared to arithmetic, one wishes to load asmuch data as possible in to the faster levels of the hierarchy, and then perform as much computation as possible before going back to theslower memory devices. This is called temporal locality : if a given datum is used more than once, we arrange the computation so that these usages occur as close together as possiblein time.
To understand temporal-locality strategies at a basic level, in this section we will employ an idealized model of a cache in a two-levelmemory hierarchy, as defined in [link] . This ideal cache stores data items from main memory (e.g. complex numbers for our purposes): when the processor loads adatum from memory, the access is quick if the datum is already in the cache (a cache hit ) and slow otherwise (a cache miss , which requires the datum to be fetched into the cache). When a datum is loaded into the cache, More generally, one can assume that a cache line of consecutive data items are loaded into the cache at once, in order to exploit spatial locality.The ideal-cache model in this case requires that the cache be tall : [link] . it must replace some other datum, and the ideal-cache model assumes that the optimalreplacement strategy is used [link] : the new datum replaces the datum that will not be needed for the longest time in the future;in practice, this can be simulated to within a factor of two by replacing the least-recently used datum [link] , but ideal replacement is much simpler to analyze. Armed with this ideal-cachemodel, we can now understand some basic features of FFT implementations that remain essentially true even on real cachearchitectures. In particular, we want to know the cache complexity , the number of cache misses for an FFT of size with an ideal cache of size , and what algorithm choices reduce this complexity.
First, consider a textbook radix-2 algorithm, which divides by 2 at each stage and operates breadth-first as in [link] (left), performing all butterflies of a given size at a time. If , then each pass over the array incurs cache misses to reload the data, and there are passes, for cache misses in total—no temporal locality at all is exploited!
One traditional solution to this problem is blocking : the computation is divided into maximal blocks that fit into the cache,and the computations for each block are completed before moving on to the next block. Here, a block of numbers can fit into the cache Of course, additional storage may be required for twiddle factors, the output data (if the FFT is not in-place),and so on, but these only affect the that fits into cache by a constant factor and hence do not impact cache-complexity analysis. Wewon't worry about such constant factors in this section. (not including storage for twiddle factors and so on), and thus the naturalunit of computation is a sub-FFT of size . Since each of these blocks involves arithmetic operations, and there are operations overall, there must be such blocks. More explicitly, one could use a radix- Cooley-Tukey algorithm, breaking down by factors of [or ] until a size is reached: each stage requires blocks, and there are stages, again giving blocks overall. Since each block requires cache misses to load it into cache, the cache complexity of such a blocked algorithm is
Notification Switch
Would you like to follow the 'Fast fourier transforms' conversation and receive update notifications?