<< Chapter < Page | Chapter >> Page > |
[link] is a VL-1 size-64 hard-coded four-step FFT. Before it can be used, an initialization procedure (not shown) allocates and populates the LUT at line 1 with the twiddle factors that are required for the step 2 multiplications. Line 44 shows the main function that executes the first sub-transform on the first column (line 49), and the second sub-transform on all remaining columns (line 50). Finally, the sub-transforms corresponding to step 4 of the four-step algorithm are executed on all columns in line 51.
The twiddle factor multiplication that corresponds to step 2 of the four-step algorithm takes place in lines 21-23 and lines 26-29. The first register is not multiplied with a twiddle factor because the first row of twiddle factors are (i.e., unity). The other registers are multiplied with two registers loaded from the LUT, which are the unpacked real and imaginary parts (see Fast interleaved complex multiplication for details about unpacked complex multiplication).
const SFFT_D __attribute__ ((aligned(32))) *LUT;
const SFFT_D *pLUT;void sfft_fcf64_fs_x1(sfft_plan_t *p, const void *vin, void *vout) {
const SFFT_D *in = vin; SFFT_D *out = vout;
SFFT_R r0,r1,r2,r3,r4,r5,r6,r7; L_4(in+0,in+64,in+32,in+96,&r0,&r1,&r2,&r3);
L_2(in+16,in+80,in+112,in+48,&r4,&r5,&r6,&r7);
K_0(&r0,&r2,&r4,&r6);
K_N(VLIT4(0.7071,0.7071,0.7071,0.7071), VLIT4(0.7071,-0.7071,0.7071,-0.7071),&r1,&r3,&r5,&r7);
r1 = MUL(r1,LOAD(pLUT+0),LOAD(pLUT+4)); TX2(r0,r1);
r2 = MUL(r2,LOAD(pLUT+8),LOAD(pLUT+12)); r3 = MUL(r3,LOAD(pLUT+16),LOAD(pLUT+20));
TX2(r2,r3); r4 = MUL(r4,LOAD(pLUT+24),LOAD(pLUT+28));
r5 = MUL(r5,LOAD(pLUT+32),LOAD(pLUT+36)); TX2(r4,r5);
r6 = MUL(r6,LOAD(pLUT+40),LOAD(pLUT+44)); r7 = MUL(r7,LOAD(pLUT+48),LOAD(pLUT+52));
TX2(r6,r7); S_4(r0,r2,r4,r6,out+0,out+4,out+8,out+12);
S_4(r1,r3,r5,r7,out+16,out+20,out+24,out+28); pLUT += 56;
}void sfft_fcf64_fs_x2(sfft_plan_t *p, const void *vin, void *vout) {
const SFFT_D *in = vin; SFFT_D *out = vout;
SFFT_R r0,r1,r2,r3,r4,r5,r6,r7; L_4(in+0,in+64,in+32,in+96,&r0,&r1,&r2,&r3);
L_2(in+16,in+80,in+112,in+48,&r4,&r5,&r6,&r7);
K_0(&r0,&r2,&r4,&r6);
K_N(VLIT4(0.7071,0.7071,0.7071,0.7071), VLIT4(0.7071,-0.7071,0.7071,-0.7071),&r1,&r3,&r5,&r7);
S_4(r0,r2,r4,r6,out+0,out+32,out+64,out+96); S_4(r1,r3,r5,r7,out+16,out+48,out+80,out+112);
}void sfft_fcf64_fs(sfft_plan_t *p, const void *vin, void *vout) {
const SFFT_D *in = vin; SFFT_D *out = vout;
pLUT = LUT; int i;
for(i=0;i<4;i++) sfft_fcf64_fs_x1(p, in+(i*4), out+(i*32));
for(i=0;i<4;i++) sfft_fcf64_fs_x2(p, out+(i*4), out+(i*4));
}
Hard-coded four-step VL-2 size-64 FFT
For , the FFTs along the columns are computed in parallel. Thus, in step 1, FFTs are computed along the columns of the array with stride , and in step 4, FFTs are computed along the columns with stride .
An implication of computing the first column in parallel with other columns is that the first column is now multiplied by unity twiddle factors, and thus only two sub-transforms are used instead of three.
Notification Switch
Would you like to follow the 'Computing the fast fourier transform on simd microprocessors' conversation and receive update notifications?