<< Chapter < Page | Chapter >> Page > |
The pattern exhibited by [link] can be exploited to access the data stored in the input array with better locality, as Figures [link] and [link] show. [link] depicts the memory access pattern of an FFT with size-16 hard-coded leaves, while [link] depicts the same FFT with sorted hard-coded leaves.
To compute the FFT with sorted leaves, the leaf sub-transforms and the body sub-transforms are split into two separate lists, and the entire list of leaf sub-transforms is computed before any of the body sub-transforms. There is, however, a cost associated with this re-arrangement: each leaf sub-transform's offset into the output array is not easy to compute because the offsets are now essentially decimated-in-frequency, and thus they are now pre-computed. Overall, the trade-off is justified because the output memory accesses within each leaf sub-transform are still linear.
The leaf transforms can be computed in three loops. The first and third loops compute size- sub-transforms, while the second loop computes size- sub-transforms. The size of the three loops , and are:
and
The transform can now be elaborated without leaf nodes, and the code for the three loops emitted in the place of calls to individual leaf sub-transforms.
[link] is the main function for the FFT that corresponds to the leaf node list in [link] . The first and third loops invoke size-16 sub-transforms at lines 8 and 16, and the second loop invokes 2x size-8 sub-transforms at line 12. Following the leaf sub-transforms, the body sub-transforms are called at lines 19-23.
void sfft_dcf128_shl16_4(sfft_plan_t *p,const void *vin,void *vout){
const SFFT_D *in = vin; SFFT_D *out = vout;
offset_t *is = p->is_base;
offset_t *offsets = p->offsets_base;
int i; for(i=3;i>0;--i) {
sfft_dcf128_shl16_4_e(is, in, out+offsets[0]);
is += 16; offsets += 1; }
for(i=3;i>0;--i) {
sfft_dcf128_shl16_4_o(is, in, out+offsets[0]);
is += 16; offsets += 1; }
for(i=2;i>0;--i) {
sfft_dcf128_shl16_4_e(is, in, out+offsets[0]);
is += 16; offsets += 1; }
sfft_dcf128_shl16_4_X_4(out+0, 32, p->ws[0]); sfft_dcf128_shl16_4_X_4(out+0, 64, p->ws[1]); sfft_dcf128_shl16_4_X_4(out+128, 32, p->ws[0]); sfft_dcf128_shl16_4_X_4(out+192, 32, p->ws[0]); sfft_dcf128_shl16_4_X_4(out+0, 128, p->ws[2]);}
Hard-coded VL-1 size-128 FFT with size-16 leaves (sub-transforms omitted)
In terms of code size, computing the leaf sub-transforms with three loops is economical. As the size of the transform grows, the code size attributed to the leaf sub-transforms remains constant. However, as the size of the transform begins to grow large (e.g., ), the instructions required for the body sub-transform calls (lines 19-23 in [link] ) begins to dominate the overall program size. "Optimizing the hierarchical structure" describes a method for compressing the code size of the body sub-transform calls while maintaining performance.
Notification Switch
Would you like to follow the 'Computing the fast fourier transform on simd microprocessors' conversation and receive update notifications?