<< Chapter < Page | Chapter >> Page > |
The branch delay slots are utilized for both branches. The branch at the bottom of the loop uses the
annul
feature to cancel the following load if the branch falls through.
The most interesting observation regarding this code is the striking similarity to the code and the code generated for the MC68020 at its top optimization level:
L3:
fmoves a1@,fp0 ! Load B(I)fadds a0@,fp0 ! Add C(I)
fmoves fp0,a2@ ! Store A(I)addql #4,a0 ! Advance by 4
addql #4,a1 ! Advance by 4addql #4,a2 ! Advance by 4
subql #1,d0 ! Decrement Itstl d0
bnes L3
The two code sequences are nearly identical! For the SPARC, it does an extra load because of its load-store architecture. On the SPARC,
I
is incremented and compared to
N
, while on the MC68020,
I
is decremented and compared to zero.
This aptly shows how the advancing compiler optimization capabilities quickly made the “nifty” features of the CISC architectures rather useless. Even on the CISC processor, the post-optimization code used the simple forms of the instructions because they produce they fastest execution time.
Note that these code sequences were generated on an MC68020. An MC68060 should be able to eliminate the three
addql
instructions by using post-increment, saving three instructions. Add a little loop unrolling, and you have some very tight code. Of course, the MC68060 was never a broadly deployed workstation processor, so we never really got a chance to take it for a test drive.
This section shows the results of compiling on the Convex C-Series of parallel/vector supercomputers. In addition to their normal registers, vector computers have vector registers that contain up to 256 64-bit elements. These processors can perform operations on any subset of these registers with a single instruction.
It is hard to claim that these vector supercomputers are more RISC or CISC. They have simple lean instruction sets and, hence, are RISC-like. However, they have instructions that implement loops, and so they are somewhat CISC-like.
The Convex C-240 has scalar registers
(s2)
, vector registers
(v2)
, and address registers
(a3)
. Each vector register has 128 elements. The vector length register controls how many of the elements of each vector register are processed by vector instructions. If vector length is above 128, the entire register is processed.
The code to implement our loop is as follows:
L4: mov.ws 2,vl ; Set the Vector length to N
ld.w 0(a5),v0 ; Load B into Vector Registerld.w 0(a2),v1 ; Load C into Vector Register
add.s v1,v0,v2 ; Add the vector registersst.w v2,0(a3) ; Store results into A
add.w #-128,s2 ; Decrement "N"add.w #512,a2 ; Advance address for A
add.w #512,a3 ; Advance address for Badd.w #512,a5 ; Advance address for Clt.w #0,s2 ; Check to see if "N" is<0
jbrs.t L4
Notification Switch
Would you like to follow the 'High performance computing' conversation and receive update notifications?