<< Chapter < Page | Chapter >> Page > |
void
sfft_fcf128_shl16_8_eo(offset_t *is,const SFFT_D *in,SFFT_D *out){ SFFT_R r0_1,r2_3,r4_5,r6_7,r8_9,r10_11,r12_13,r14_15,
r16_17,r18_19,r20_21,r22_23,r24_25,r26_27,r28_29,r30_31; L_4_4(in+is[0],in+is[1],in+is[2],in+is[3],
&r0_1,&r2_3,&r16_17,&r18_19);
L_2_2(in+is[4],in+is[5],in+is[6],in+is[7], &r4_5,&r6_7,&r20_21,&r22_23);
K_N(VLIT4(0.7071,0.7071,1,1),VLIT4(0.7071,-0.7071,0,-0), &r0_1,&r2_3,&r4_5,&r6_7);
L_4_2(in+is[8],in+is[9],in+is[10],in+is[11], &r8_9,&r10_11,&r28_29,&r30_31);
L_4_4(in+is[12],in+is[13],in+is[14],in+is[15], &r12_13,&r14_15,&r24_25,&r26_27);
K_N(VLIT4(0.9239,0.9239,1,1),VLIT4(0.3827,-0.3827,0,-0), &r0_1,&r4_5,&r8_9,&r12_13);
S_4(r0_1,r4_5,r8_9,r12_13,out0+0,out0+8,out0+16,out0+24); K_N(VLIT4(0.3827,0.3827,0.7071,0.7071),
VLIT4(0.9239,-0.9239,0.7071,-0.7071), &r2_3,&r6_7,&r10_11,&r14_15);
S_4(r2_3,r6_7,r10_11,r14_15,out0+4,out0+12,out0+20,out0+28); K_N(VLIT4(0.7071,0.7071,1,1),VLIT4(0.7071,-0.7071,0,-0),
&r16_17,&r18_19,&r20_21,&r22_23);
S_4(r16_17,r18_19,r20_21,r22_23,out1+0,out1+4,out1+8,out1+12); K_N(VLIT4(0.7071,0.7071,1,1),VLIT4(0.7071,-0.7071,0,-0),
&r24_25,&r26_27,&r28_29,&r30_31);
S_4(r24_25,r26_27,r28_29,r30_31,out1+16,out1+20,out1+24,out1+28);}
Heterogeneous size-16 leaf sub-transform for VL-2 size-128 hard-coded leaf FFT
In the case of a vector comprising heterogeneous leaf sub-transforms, the data is transposed into separate sub-transforms following the primitive leaf operations. The remainder of the computation is carried out separately for each leaf sub-transform in the vector, and no further transposes are required.
When elaborating and generating code for VL-2 transforms, there are only two heterogeneous leaf sub-transforms that might be required, but for other vector lengths the combinations are more complex. During the elaboration process, each unique combination that is encountered in the sorted list of leaf sub-transforms is elaborated into a function with repeated calls to the
elaborate
function, as was done in
"Vector length 1" in order to elaborate a sub-transform composed of two size
sub-transforms.
[link] is an example of a heterogeneous size-16 VL-2 leaf sub-transform, where one size-16 leaf sub-transform is loaded into the lower halves of the vector registers, and the data from another leaf sub-transform composed of two size-8 sub-transforms is loaded into the upper halves. The primitive leaf operations at lines 5, 7, 11 and 13 transpose each sub-transform's data into separate vector registers, and the remainder of the computation is performed on each sub-transform separately. The size-16 sub-transform is stored to sequential locations in memory at lines 17 and 21, while the sub-transform composed of two size-8 leaf sub-transforms is stored to memory at lines 24 and 27.
Some machines support streaming store or non-temporal store instructions; these instructions are used to store data to locations that do not have temporal locality, and thus the cache can be bypassed. The hard-coded leaf FFT described in the previous sections splits the computation into a pass of leaf sub-transforms and several passes of body sub-transforms. For large transforms where the size of the data exceeds the outermost level of cache, the non-temporal store instructions can be used in the leaf sub-transforms to bypass the cache when storing data to the output array; this can greatly improve performance by keeping other data in cache. The Intel SSE and AVX vector extensions both support streaming stores.
Notification Switch
Would you like to follow the 'Computing the fast fourier transform on simd microprocessors' conversation and receive update notifications?