<< Chapter < Page | Chapter >> Page > |
But one bright spot is the branch delay slot. For the first iteration, the load was done before the loop started. For the successive iterations, the first load was done in the branch delay slot at the bottom of the loop.
Comparing this code to the moderate optimization code on the MC68020, you can begin to get a sense of why RISC was not an overnight sensation. It turned out that an unsophisticated compiler could generate much tighter code for a CISC processor than a RISC processor. RISC processors are always executing extra instructions here and there to compensate for the lack of slick features in their instruction set. If a processor has a faster clock rate but has to execute more instructions, it does not always have better performance than a slower, more efficient processor.
But as we shall soon see, this CISC advantage is about to evaporate in this particular example.
We now increase the optimization to
-O2
. Now the compiler generates much better code. It’s important you remember that this is the same compiler being used for all three examples.
At this optimization level, the compiler looked through the code sufficiently well to know it didn’t even need to rotate the register windows (no save instruction). Clearly the compiler looked at the register usage of the entire routine:
! Note, didn’t even rotate the register Window
! We just use the %o registers from the caller! %o0 = Address of first element of A (from calling convention)! %o1 = Address of first element of B (from calling convention)
! %o2 = Address of first element of C (from calling convention)! %o3 = Address of N (from calling convention)addem_:
ld [%o3],%g2 ! Load N
cmp %g2,1 ! Check to see if it is<1
bl .L77000006 ! Check for zero trip loopor %g0,1,%g1 ! Delay slot - Set I to 1
.L77000003:ld [%o1],%f0 ! Load B(I) First time Only.L900000109:
ld [%o2],%f1 ! Load C(I)
fadds %f0,%f1,%f0 ! Addadd %g1,1,%g1 ! Increment I
add %o1,4,%o1 ! Increment Address of Badd %o2,4,%o2 ! Increment Address of C
cmp %g1,%g2 ! Check Loop Terminationst %f0,[%o0] ! Store A(I)add %o0,4,%o0 ! Increment Address of A
ble,a .L900000109 ! Branch w/ annulld [%o1],%f0 ! Load the B(I).L77000006:
retl ! Leaf Return (No window)nop ! Branch Delay Slot
This is tight code. The registers
o0
,
o1
, and
o2
contain the addresses of the first elements of
A
,
B
, and
C
respectively. They already point to the right value for the first iteration of the loop. The value for
I
is never stored in memory; it is kept in global register
g1
. Instead of multiplying
I
by 4, we simply advance the three addresses by 4 bytes each iteration.
Notification Switch
Would you like to follow the 'High performance computing' conversation and receive update notifications?