<< Chapter < Page | Chapter >> Page > |
One reason to generate the code using this simplistic approach is to guarantee that the program will produce the correct results. Looking at the above code, it’s pretty easy to argue that it indeed does exactly what the FORTRAN code does. You can track every single assembly statement directly back to part of a FORTRAN statement.
It’s pretty clear that you don’t want to execute this code in a high performance production environment without some more optimization.
In this example, we enable some optimization (
-O1
):
save %sp,-120,%sp ! Rotate the register window
add %i0,-4,%o0 ! Address of A(0)st %o0,[%fp-12] ! Store on the stackadd %i1,-4,%o0 ! Address of B(0)
st %o0,[%fp-4]! Store on the stack
add %i2,-4,%o0 ! Address of C(0)st %o0,[%fp-8] ! Store on the stacksethi %hi(GPB.addem.i),%o0 ! Address of I (top portion)
add %o0,%lo(GPB.addem.i),%o2 ! Address of I (lower portion)ld [%i3],%o0 ! %o0 = N (fourth parameter)or %g0,1,%o1 ! %o1 = 1 (for addition)
st %o0,[%fp-20]! store N on the stack
st %o1,[%o2]! Set memory copy of I to 1
ld [%o2],%o1 ! o1 = I (kind of redundant)
cmp %o1,%o0 ! Check I>N (zero-trip?)
bg .L12 ! Don’t do loop at allnop ! Delay Slot
ld [%o2],%o0 ! Pre-load for Branch Delay Slot
.L900000110: ! Top of the loopld [%fp-4],%o1 ! o1 = Address of B(0)sll %o0,2,%o0 ! Multiply I by 4
ld [%o1+%o0],%f2 ! f2 = B(I)
ld [%o2],%o0 ! Load I from memory
ld [%fp-8],%o1 ! o1 = Address of C(0)
sll %o0,2,%o0 ! Multiply I by 4ld [%o1+%o0],%f3 ! f3 = C(I)fadds %f2,%f3,%f2 ! Register-to-register add
ld [%o2],%o0 ! Load I from memory (not again!)
ld [%fp-12],%o1 ! o1 = Address of A(0)
sll %o0,2,%o0 ! Multiply I by 4 (yes, again)st %f2,[%o1+%o0] ! A(I) = f2ld [%o2],%o0 ! Load I from memoryadd %o0,1,%o0 ! Increment I in register
st %o0,[%o2]! Store I back into memory
ld [%o2],%o0 ! Load I back into a register
ld [%fp-20],%o1 ! Load N into a register
cmp %o0,%o1 ! I>N ??
ble,a .L900000110ld [%o2],%o0 ! Branch Delay Slot
This is a significant improvement from the previous example. Some loop constant computations (subtracting 4) were hoisted out of the loop. We only loaded
I
4 times during a loop iteration. Strangely, the compiler didn’t choose to store the addresses of
A(0)
,
B(0)
, and
C(0)
in registers at all even though there were plenty of registers. Even more perplexing is the fact that it loaded a value from memory immediately after it had stored it from the exact same register!
Notification Switch
Would you like to follow the 'High performance computing' conversation and receive update notifications?