<< Chapter < Page | Chapter >> Page > |
Size | Input array addresses |
16 | {0, 64, 32, 96, 16, 80, 112, 48, 8, 72, 40, 104, 120, 56, 24, 88} |
16 | {1, 65, 33, 97, 17, 81, 113, 49, 9, 73, 41, 105, 121, 57, 25, 89} |
16 | {2, 66, 34, 98, 18, 82, 114, 50, 10, 74, 42, 106, 122, 58, 26, 90} |
8(x2) | {3, 67, 35, 99, 19, 83, 115, 51, 123, 59, 27, 91, 11, 75, 107, 43} |
8(x2) | {4, 68, 36, 100, 20, 84, 116, 52, 124, 60, 28, 92, 12, 76, 108, 44} |
8(x2) | {5, 69, 37, 101, 21, 85, 117, 53, 125, 61, 29, 93, 13, 77, 109, 45} |
16 | {126, 62, 30, 94, 14, 78, 110, 46, 6, 70, 38, 102, 118, 54, 22, 86} |
16 | {127, 63, 31, 95, 15, 79, 111, 47, 7, 71, 39, 103, 119, 55, 23, 87} |
[link] shows the sorted size-16 leaf sub-transforms for a size-128 transform with the rows divided into VL-2 groups. Because each group of two leaf sub-transforms loads data from adjacent memory locations, the group of sub-transforms can be loaded in parallel with vector memory operations, and all (or some) of the computation done in parallel. The first, third and fourth groups in [link] contain leaf nodes of the same size/type; these are the easiest vector leaf sub-transforms to compute, as described in "Homogeneous leaf sub-transform vectors" . The second group of rows contains leaf sub-transforms of differing size/type, and computing these sub-transforms is covered separately in "Heterogeneous leaf sub-transform vectors" .
void
sfft_fcf128_shl16_8_ee(offset_t *is,const SFFT_D *in,SFFT_D *out){ SFFT_R r0,r1,r2,r3,r4,r5,r6,r7,r8,r9,r10,r11,r12,r13,r14,r15;
L_4(in+is[0],in+is[1],in+is[2],in+is[3],&r0,&r1,&r2,&r3);
L_2(in+is[4],in+is[5],in+is[6],in+is[7],&r4,&r5,&r6,&r7);
K_0(&r0,&r2,&r4,&r6);
K_N(VLIT4(0.7071,0.7071,0.7071,0.7071), VLIT4(0.7071,-0.7071,0.7071,-0.7071),
&r1,&r3,&r5,&r7);
L_4(in+is[8],in+is[9],in+is[10],in+is[11],&r8,&r9,&r10,&r11);
L_4(in+is[12],in+is[13],in+is[14],in+is[15],&r12,&r13,&r14,&r15);
K_0(&r0,&r4,&r8,&r12);
K_N(VLIT4(0.9239,0.9239,0.9239,0.9239), VLIT4(0.3827,-0.3827,0.3827,-0.3827),
&r1,&r5,&r9,&r13);
TX2(&r0,&r1); TX2(&r4,&r5); TX2(&r8,&r9); TX2(&r12,&r13);
S_4(r0,r4,r8,r12,out0+0,out0+8,out0+16,out0+24); S_4(r1,r5,r9,r13,out1+0,out1+8,out1+16,out1+24);
K_N(VLIT4(0.7071,0.7071,0.7071,0.7071), VLIT4(0.7071,-0.7071,0.7071,-0.7071),
&r2,&r6,&r10,&r14);
K_N(VLIT4(0.3827,0.3827,0.3827,0.3827), VLIT4(0.9239,-0.9239,0.9239,-0.9239),
&r3,&r7,&r11,&r15);
TX2(&r2,&r3); TX2(&r6,&r7); TX2(&r10,&r11); TX2(&r14,&r15);
S_4(r2,r6,r10,r14,out0+4,out0+12,out0+20,out0+28); S_4(r3,r7,r11,r15,out1+4,out1+12,out1+20,out1+28);
}
Homogeneous size-16 leaf sub-transform for VL-2 size-128 hard-coded leaf FFT
The vector leaf sub-transforms of a single size/type are handled in the same way as a VL-1 sub-transform, with one difference: the vector registers must be transposed before the data is stored to memory in the output array. In the example shown in [link] , the transposes take place at lines 16 and 25.
Prior to the store operations, each position of the vector register (each position being a whole complex word) contains an element belonging to each of the leaf sub-transforms composing the vectorized sub-transform. Because each leaf sub-transform is stored sequentially to different locations in memory with aligned vector store operations, sets of registers are transposed such that each vector register contains elements from only one leaf sub-transform.
Notification Switch
Would you like to follow the 'Computing the fast fourier transform on simd microprocessors' conversation and receive update notifications?