Scope
In investigating the hypotheses, the scope of this work has been limited in several ways:
- It is limited to single-threaded complex 1D FFTs, because
multi-dimensional, multi-threaded or multi-processor FFTs (or anycombination thereof) are ultimately decomposed into 1D components running on
a single core, and all other things being equal, it is the performance ofthese 1D components running on a single microprocessor core that determines
the overall performance of a given multi-threaded implementation;
- It is limited to transforms that operate on vectors of length
where
, because these are the easiest to compute on
machines, and consequently the most often used by applications. This excludesthe prime-factor algorithm
[link] ,
[link] ,
and the Radar
[link] and
Bluestein
[link] ,
[link] ,
[link] algorithms for prime sizes;
- It is limited to the
split-radix
[link] ,
[link] ,
[link] ,
[link] ,
[link] and
conjugate-pair
[link] ,
[link] ,
[link] ,
[link] algorithms. The Winograd
algorithm
[link] ,
[link] ,
[link] ,
[link] is excluded because of its low performance on systems where multiplication
costs about the same as addition;
- It is limited to out-of-place transforms, because they are generally
faster than in-place transforms, except at the boundaries of thecache
[link] ;
- The benchmark experiments are limited to the Intel x86 and ARM
machines, because it is estimated that 92% of the microprocessors in therapidly expanding mobile market are ARM devices
[link] , while
Intel's share of the worldwide PC and mobile PC microprocessors markets isestimated to be 79.3% and 84.4%,
respectively
[link] .
Contributions
The contributions of this work are summarized as follows:
- Three methods of computing the conjugate-pair algorithm on SIMD
microprocessors are described in
Streaming FFT ;
- The source code for the high-performance SIMD FFT library
developed in this thesis is publicly available under a permissive open sourcelicence
on github.
Organization
This work is divided into two parts. The first part,
Chapters 1-4, encompasses therelevant background, while the second part,
Chapters 5-8, is concerned withcontributions that challenge the state of the art.
A brief overview of the contents of each chapter:
-
Algorithms provides an overview of FFT algorithms from the mathematical perspective;
-
Implementation details complements the mathematical perspective
of the previous chapter with a more focused view of the low level detailsthat are relevant to efficient implementation on SIMD microprocessors;
-
Existing libraries reviews existing state of the art libraries,
with reference to algorithms and implementation details of the previouschapters;
-
Streaming FFT describes SFFT, a library for
SIMD microprocessors that is, in many cases, faster than the state ofthe art FFT libraries reviewed in
Existing libraries ;
-
Benchmark methods describes the benchmarking methods used to
evaluate performance and accuracy of various FFT implementations throughoutthis work;
-
Results and discussion presents the results of benchmarks on 18
different machines, as well as the results of model-based optimizationexperiments, with reference to earlier chapters and other related work;
-
Conclusions and future work concludes the work with a review of
the hypotheses, a summary of the contributions, and some idea for directionsthat future work might take.