<< Chapter < Page | Chapter >> Page > |
address(A) + (I-1)*sizeof_datatype(A)
+ (J-1)*sizeof_datatype(A) * column_dimension(A)
If
A(I,J)
is used more than once, we have multiple copies of the same address computation. Common subexpression elimination will (hopefully) discover the redundant computations and group them together.
Loops are where many high performance computing programs spend a majority of their time. The compiler looks for every opportunity to move calculations out of a loop body and into the surrounding code. Expressions that don’t change after the loop is entered ( loop-invariant expressions ) are prime targets. The following loop has two loop-invariant expressions:
DO I=1,N
A(I) = B(I) + C * DE = G(K)
ENDDO
Below, we have modified the expressions to show how they can be moved to the outside:
temp = C * D
DO I=1,NA(I) = B(I) + temp
ENDDOE = G(K)
It is possible to move code before or after the loop body. As with common subexpression elimination, address arithmetic is a particularly important target for loop- invariant code motion. Slowly changing portions of index calculations can be pushed into the suburbs, to be executed only when needed.
Loops can contain what are called
induction variables . Their value changes as a linear function of the loop iteration count. For example,
K
is an induction variable in the following loop. Its value is tied to the loop index:
DO I=1,N
K = I*4 + M...
ENDDO
Induction variable simplification replaces calculations for variables like
K
with simpler ones. Given a starting point and the expression’s first derivative, you can arrive at
K
’s value for the
n th iteration by stepping through the
n-1 intervening iterations:
K = M
DO I=1,NK = K + 4
...ENDDO
The two forms of the loop aren’t equivalent; the second won’t give you the value of
K
given any value of
I
. Because you can’t jump into the middle of the loop on the
n th iteration,
K
always takes on the same values it would have if we had kept the original expression.
Induction variable simplification probably wouldn’t be a very important optimization, except that array address calculations look very much like the calculation for
K
in the example above. For instance, the address calculation for
A(I)
within a loop iterating on the variable
I
looks like this:
address = base_address(A) + (I-1) * sizeof_datatype(A)
Performing all that math is unnecessary. The compiler can create a new induction variable for references to A and simplify the address calculations:
outside the loop...
address = base_address(A) - (1 * sizeof_datatype(A))indie the loop...
address = address + sizeof_datatype(A)
Induction variable simplification is especially useful on processors that can automatically increment a register each time it is used as a pointer for a memory reference. While stepping through a loop, the memory reference and the address arithmetic can both be squeezed into a single instruction—a great savings.
Precompilation, lexical analysis, parsing, and many optimization techniques are somewhat portable, but code generation is very specific to the target processor. In some ways this phase is where compilers earn their keep on single-processor RISC systems.
Anything that isn’t handled in hardware has to be addressed in software. That means if the processor can’t resolve resource conflicts, such as overuse of a register or pipeline, then the compiler is going to have to take care of it. Allowing the compiler to take care of it isn’t necessarily a bad thing — it’s a design decision. A complicated compiler and simple, fast hardware might be cost effective for certain applications. Two processors at opposite ends of this spectrum are the MIPS R2000 and the HP PA-8000. The first depends heavily on the compiler to schedule instructions and fairly distribute resources. The second manages both things at runtime, though both depend on the compiler to provide a balanced instruction mix.
In all computers, register selection is a challenge because, in spite of their numbers, registers are precious. You want to be sure that the most active variables become register resident at the expense of others. On machines without register renaming (see [link] ), you have to be sure that the compiler doesn’t try to recycle registers too quickly, otherwise the processor has to delay computations as it waits for one to be freed.
Some instructions in the repertoire also save your compiler from having to issue others. Examples are auto-increment for registers being used as array indices or conditional assignments in lieu of branches. These both save the processor from extra calculations and make the instruction stream more compact.
Lastly, there are opportunities for increased parallelism. Programmers generally think serially, specifying steps in logical succession. Unfortunately, serial source code makes serial object code. A compiler that hopes to efficiently use the parallelism of the processor will have to be able to move instructions around and find operations that can be issued side by side. This is one of the biggest challenges for compiler writers today. As superscalar and very long instruction word (VLIW) designs become capable of executing more instructions per clock cycle, the compiler will have to dig deeper for operations that can execute at the same time.
Notification Switch
Would you like to follow the 'High performance computing' conversation and receive update notifications?