<< Chapter < Page | Chapter >> Page > |
Appendix 2 contains listings of source code that augment each of the simple implementations from the previous section with LUTs ofprecomputed coefficients. The modifications are fairly minor: each implementation now has an initialization function that populates theLUT(s) based on the size of the transform to be computed, and each transform function now has a parameter of , so as to economically index the twiddle factors with little computation.
As [link] shows, the speedup resulting from the precomputed twiddle LUT is dramatic – sometimes more than a factor of 6(cf. [link] ). Interestingly, the ordinary split-radix algorithm is now faster than the conjugate-pair algorithm, and inspection ofthe compiler output shows that this is due to the more complicated addressing scheme at the leaves of the computation, and because the compiler lacks goodheuristics for complex multiplication by a conjugate. The performance of the tangent FFT is hampered by the same problem, yet the tangent FFT hasbetter performance, which can be attributed to the tangent FFT having larger straight line blocks of code at the leaves of the computation (thetangent FFT has leaves of size 4, while the split-radix and conjugate-pair FFTs have leaves of size 2).
The performance of the programs in the previous section may be further improved by explicitly describing the computation with SIMD intrinsics. Auto-vectorizing compilers, such as theIntel C compiler used to compile the previous examples, can extract some data-level parallelism and generate SIMD codefrom a scalar description of a computation, but better results can be obtained when using vector intrinsics to explicitly specify the parallel computation.
Intrinsics are an alternative to inline assembly code when the compiler fails to meet performance constraints. In most cases an intrinsic function directly maps to a single instruction on the underlying machine, and so intrinsics provide many of the advantages of inline assembler code. Butin contrast to inline assembler code, the compiler uses its detailed knowledge of the intrinsic semantics to providebetter optimizations and handle tasks such as register allocation.
Almost all desktop and handheld machines now have processors that implement some sort of SIMD extension to the instruction set. All major Intelprocessors since the Pentium III have implemented SSE, an extension to the x86 architecture that introduced 4-way single precision floating pointcomputation with a new register file consisting of eight 128-bit SIMD registers – known as XMM registers. The AMD64 architecture doubled the numberof XMM registers to 16, and Intel followed by implementing 16 XMM registers in the Intel 64 architecture. SSE has since been expanded with support forother data types and new instructions with the introduction of SSE2, SSE3, SSSE3 and SSE4. Most notably, SSE2 introduced support for double precisionfloating point arithmetic and thus Intel's first attempt at SIMD extensions, MMX, was effectively deprecated. Intel's recent introductionof the sandybridge micro-architecture heralded the first implementation of AVX – a major upgrade to SSE that doubled the size of XMM registersto 256 bits (and renamed them YMM registers), enabling 8-way single precision and 4-way double precision floating point arithmetic.
Notification Switch
Would you like to follow the 'Computing the fast fourier transform on simd microprocessors' conversation and receive update notifications?