<< Chapter < Page | Chapter >> Page > |
Like the Intel, this is not a load-store architecture. The
fadds
instruction adds a value from memory to a value in a register (
fp0
) and leaves the result of the addition in the register. Unlike the Intel 8088, we have enough registers to store quite a few of the values used throughout the loop (
I
,
N
, the address of
A
,
B
, and
C
) in registers to save memory operations.
In the next example, we compiled the C version of the loop with the normal optimization (
-O
) turned on. We see the C perspective on arrays in this code. C views arrays as extensions to pointers in C; the loop index advances as an offset from a pointer to the beginning of the array:
! d3 = I
! d1 = Address of A! d2 = Address of B
! d0 = Address of C! a6@(20) = N
moveq #0,d3 ! Initialize Ibras L5 ! Jump to End of the loop
L1: movl d3,a1 ! Make copy of Imovl a1,d4 ! Again
asll #2,d4 ! Multiply by 4 (word size)movl d4,a1 ! Put back in an address register
fmoves a1@(0,d2:l),fp0 ! Load B(I)movl a6@(16),d0 ! Get address of C
fadds a1@(0,d0:l),fp0 ! Add C(I)fmoves fp0,a1@(0,d1:l) ! Store into A(I)
addql #1,d3 ! Increment IL5:
cmpl a6@(20),d3bits L1
We first see the value of
I
being copied into several registers and multiplied by 4 (using a left shift of 2, strength reduction). Interestingly, the value in register
a1
is
I
multiplied by 4. Registers
d0
,
d1
, and
d2
are the addresses of
C
,
B
, and
A
respectively. In the load, add, and store,
a1
is the base of the address computation and
d0
,
d1
, and
d2
are added as an offset to
a1
to compute each address.
This is a simplistic optimization that is primarily trying to maximize the values that are kept in registers during loop execution. Overall, it’s a relatively literal translation of the C language semantics from C to assembly. In many ways, C was designed to generate relatively efficient code without requiring a highly sophisticated optimizer.
In this example, we are back to the FORTRAN version on the MC68020. We have compiled it with the highest level of optimization (
-OLM
) available on this compiler. Now we see a much more aggressive approach to the loop:
! a0 = Address of C(I)
! a1 = Address of B(I)! a2 = Address of A(I)
L3:fmoves a1@,fp0 ! Load B(I)
fadds a0@,fp0 ! Add C(I)fmoves fp0,a2@ ! Store A(I)
addql #4,a0 ! Advance by 4addql #4,a1 ! Advance by 4
addql #4,a2 ! Advance by 4subql #1,d0 ! Decrement I
tstl d0bnes L3
First off, the compiler is smart enough to do all of its address adjustment outside the loop and store the adjusted addresses of
A
,
B
, and
C
in registers. We do the load, add, and store in quick succession. Then we advance the array addresses by 4 and perform the subtraction to determine when the loop is complete.
Notification Switch
Would you like to follow the 'High performance computing' conversation and receive update notifications?