<< Chapter < Page | Chapter >> Page > |
This is very tight code and bears little resemblance to the original FORTRAN code.
These next examples were performed using a SPARC architecture system using FORTRAN. The SPARC architecture is a classic RISC processor using load-store access to memory, many registers and delayed branching. We first examine the code at the lowest optimization:
.L18: ! Top of the loop
ld [%fp-4],%l2 ! Address of B
sethi %hi(GPB.addem.i),%l0 ! Address of I in %l0or %l0,%lo(GPB.addem.i),%l0
ld [%l0+0],%l0 ! Load I
sll %l0,2,%l1 ! Multiply by 4add %l2,%l1,%l0 ! Figure effective address of B(I)
ld [%l0+0],%f3 ! Load B(I)
ld [%fp-8],%l2 ! Address of C
sethi %hi(GPB.addem.i),%l0 ! Address of I in %l0or %l0,%lo(GPB.addem.i),%l0
ld [%l0+0],%l0 ! Load I
sll %l0,2,%l1 ! Multiply by 4add %l2,%l1,%l0 ! Figure effective address of B(I)
ld [%l0+0],%f2 ! Load C(I)
fadds %f3,%f2,%f2 ! Do the Floating Point Addld [%fp-12],%l2 ! Address of Asethi %hi(GPB.addem.i),%l0 ! Address of i in %l0
or %l0,%lo(GPB.addem.i),%l0ld [%l0+0],%l0 ! Load Isll %l0,2,%l1 ! Multiply by 4
add %l2,%l1,%l0 ! Figure effective address of A(I)st %f2,[%l0+0] ! Store A(I)sethi %hi(GPB.addem.i),%l0 ! Address of i in %l0
or %l0,%lo(GPB.addem.i),%l0ld [%l0+0],%l0 ! Load Iadd %l0,1,%l1 ! Increment I
sethi %hi(GPB.addem.i),%l0 ! Address of I in %l0or %l0,%lo(GPB.addem.i),%l0
st %l1,[%l0+0]! Store I
sethi %hi(GPB.addem.i),%l0 ! Address of I in %l0or %l0,%lo(GPB.addem.i),%l0
ld [%l0+0],%l1 ! Load I
ld [%fp-20],%l0 ! Load N
cmp %l1,%l0 ! Compareble .L18
nop ! Branch Delay Slot
This is some pretty poor code. We don’t need to go through it line by line, but there are a few quick observations we can make. The value for
I
is loaded from memory five times in the loop. The address of
I
is computed six times throughout the loop (each time takes two instructions). There are no tricky memory addressing modes, so multiplying
I
by 4 to get a byte offset is done explicitly three times (at least they use a shift). To add insult to injury, they even put a NO-OP in the branch delay slot.
One might ask, “Why do they ever generate code this bad?” Well, it’s not because the compiler isn’t capable of generating efficient code, as we shall see below. One explanation is that in this optimization level, it simply does a one-to-one translation of the tuples (intermediate code) into machine language. You can almost draw lines in the above example and precisely identify which instructions came from which tuples.
Notification Switch
Would you like to follow the 'High performance computing' conversation and receive update notifications?