<< Chapter < Page | Chapter >> Page > |
[link] contains references to several types, functions and macros that use upper-case identifiers – these are
primitive functions or types that have been predefined as inline functions or macros. A benefit of using primitives in this way is that the details specific to numerical representation and the underlying machine have been abstracted away; thus, the same function can be compiled for a variety of types and machines by simply including a different header file with different primitives.
[link] , for example, could be compiled for double-precision arithmetic on an SSE2 machine by including
sse_double.h
, or it could be compiled with much slower scalar arithmetic by including
scalar.h
. The same code can even be used, without modification, to compute forward and backwards transforms, by using C preprocessor directives to conditionally alter the macros.
In order to accommodate mixed numerical representations, the signature of the outermost function references data with void pointers. In the case of the double-precision example in
[link] ,
SFFT_D
would be defined to be
double
in the appropriate header file, and the void pointers are then cast to
SFFT_D
pointers.
The size-8 transform in
[link] uses 8 registers, and thus a declaration of 8 registers of type
SFFT_R
has been emitted at line 4 in
[link] . In the case of double-precision arithmetic on a SSE2 machine,
SFFT_R
is defined as
__m128d
in
sse_double.h
.
The first two rows of
[link] correspond to lines 6 and 7 of
[link] , respectively. The
L_4
primitive is used to compute the size-4 leaf node in the first row of the table. The second row is a load/leaf node of size 2(x2), indicating two size-2 nodes in parallel, which is computed with the
L_2
primitive. The input addresses in the table are the addresses of complex words, while the addresses in the generated code refer to the real and imaginary parts of a complex word, and thus the addresses from
[link] are multiplied by a factor of 2 to obtain the addresses in
[link] .
The final two
CNodeBfly
nodes of
[link] correspond to the
K_0
and
K_N
sub-transform (a.k.a. butterfly) primitives at lines 8 and 10, respectively. Because the node in the third row of
[link] has a twiddle factor of
(i.e., unity), the computation requires no multiplication, and the
K_0
primitive is used for this special case. The
K_N
primitive at line 10 does require a twiddle factor, which is passed to
K_N
as two vector literals that represent the twiddle factor in unpacked form.
For the purposes of brevity, the precision has been truncated to only a few decimal places.
Fast interleaved complex multiplication describes how interleaved complex multiplication is faster if one operand is pre-unpacked.
After each node is processed, the registers that have been used by it are checked in a map (
rmap
) that maps each register to the last node to have used that register. If the current node is the last node to have used a register, the register is stored to memory. In the case of the transform in
[link] , four registers are stored with an instance of the
S_4
primitive at lines 9 and 11. In contrast to the load operations at the leaves, which are decimated-in-time and thus effectively pseudo-random memory accesses, the store operations are to linear regions of memory, the addresses of which can be determined from each register's integer identifier. The store address offset for data in register
is simply
.
Notification Switch
Would you like to follow the 'Computing the fast fourier transform on simd microprocessors' conversation and receive update notifications?