<< Chapter < Page | Chapter >> Page > |
void sfft_dcf8192_shl16_8_4(sfft_plan_t *p, SFFT_D *out) {
X_4(out+0, 32, p->ws[0]); X_4(out+128, 32, p->ws[0]); X_4(out+192, 32, p->ws[0]); X_8(out+0, 128, p->ws[2]);}
void sfft_dcf8192_shl16_8_5(sfft_plan_t *p, SFFT_D *out) { X_8(out+0, 64, p->ws[1]); X_8(out+128, 64, p->ws[1]);}
void sfft_dcf8192_shl16_8_9(sfft_plan_t *p, SFFT_D *out) { X_8(out+0, 64, p->ws[1]); X_4(out+128, 32, p->ws[0]); X_4(out+192, 32, p->ws[0]); sfft_dcf8192_shl16_8_5(p, out+256);
X_8(out+0, 256, p->ws[3]);}
void sfft_dcf8192_shl16_8_13(sfft_plan_t *p, SFFT_D *out) { sfft_dcf8192_shl16_8_4(p, out+0);
sfft_dcf8192_shl16_8_5(p, out+256); sfft_dcf8192_shl16_8_4(p, out+512);
sfft_dcf8192_shl16_8_4(p, out+768); X_8(out+0, 512, p->ws[4]);}
void sfft_dcf8192_shl16_8_14(sfft_plan_t *p, SFFT_D *out) { sfft_dcf8192_shl16_8_9(p, out+0);
sfft_dcf8192_shl16_8_9(p, out+512);}
void sfft_dcf8192_shl16_8_18(sfft_plan_t *p, SFFT_D *out) { sfft_dcf8192_shl16_8_9(p, out+0);
sfft_dcf8192_shl16_8_4(p, out+512); sfft_dcf8192_shl16_8_4(p, out+768);
sfft_dcf8192_shl16_8_14(p, out+1024); X_8(out+0, 1024, p->ws[5]);}
void sfft_dcf8192_shl16_8_22(sfft_plan_t *p, SFFT_D *out) { sfft_dcf8192_shl16_8_13(p, out+0);
sfft_dcf8192_shl16_8_14(p, out+1024); sfft_dcf8192_shl16_8_13(p, out+2048);
sfft_dcf8192_shl16_8_13(p, out+3072); X_8(out+0, 2048, p->ws[6]);}
void sfft_dcf8192_shl16_8_1(sfft_plan_t *p, SFFT_D *out) { sfft_dcf8192_shl16_8_22(p, out+0);
sfft_dcf8192_shl16_8_18(p, out+4096); sfft_dcf8192_shl16_8_18(p, out+6144);
sfft_dcf8192_shl16_8_22(p, out+8192); sfft_dcf8192_shl16_8_22(p, out+12288);
X_8(out+0, 8192, p->ws[8]);}
Optimized body sub-transforms for size-8192 FFT
A size-8192 hard-coded leaf FFT requires 229 calls to radix-2/4 and size-8 body sub-transforms. After optimizing the sequence of calls with Sequitur, the compact set of functions shown in [link] replaces a sequence of 229 calls.
Compared to the full list of statically elaborated calls, the optimized set of functions requires less code space while achieving better performance; and compared to a recursive program, the optimized set of function calls is faster (due to lower call and stack overhead) while trading off an acceptably small amount of code space.
The technique presented in this section has been verified for transforms ranging in size from through to (32 mega) points. The technique works well up until sizes of about points, but for larger transforms the elaboration and compile times begin to exceed 1 second or so, and the code size again begins to grow large. For transforms larger than points, a recursive program can be used until leaves of size points are reached, at which point the technique presented in this section is used.
The method of vectorizing the hard-coded leaf FFT is similar to that of the hard-coded FFT in "Other vector lengths" ; the only difference here is the level of scale.
The hard-coded FFT was vectorized by collecting together primitive leaf operations that loaded data from adjacent memory locations. The hard-coded leaf FFT has already been sorted such that consecutive leaf sub-transforms load data from adjacent memory locations (see "Improving memory locality in the leaves" ), so the task is easier in this case – at least in one respect.
Notification Switch
Would you like to follow the 'Computing the fast fourier transform on simd microprocessors' conversation and receive update notifications?