<< Chapter < Page | Chapter >> Page > |
Now you should know enough about C62x assembly to implementthe inner product algorithm to compute
(Inner product): Write the complete inner product assembly program to compute where and take the following values:
a[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, a }x[] = { f, e, d, c, b, a, 9, 8, 7, 6 }
The and values must be stored in memory and the inner product is computed by reading the memory contents.
Intentionally left blank.
When an instruction is executed, it takes several steps, which are fetching, decoding, and execution. If these steps aredone one at a time for each instruction, the CPU resources are not fully utilized. To increase the throughput, CPUs aredesigned to be pipelined, meaning that the foregoing steps are carried out at the same time.
On the C6x processor, the instruction fetch consists of 4
phases; generate fetch address (F1), send address to memory(F2), wait for data (F3), and read opcode from memory (F4).
Decoding consists of 2 phases; dispatching to functional units(D1) and decoding (D2). The execution step may consist of up
to 6 phases (E1 to E6) depending on the instructions. Forexample, the multiply (
MPY
) instructions
has 1 delay resulting in 2 execution phases. Similarly, load(
LDx
) and branch (
B
)
instructions have 4 and 5 delays respectively.
When the outcome of an instruction is used by the next
instruction, an appropriate number of
NOP
s (no operation or delay) must be
added after multiply (one
NOP
), load
(four
NOP
s, or
NOP
4
), and branch (five
NOP
s, or
NOP 5
) instructions in order to allow the
pipeline to operate properly. Otherwise, before the outcomeof the current instruction is available (which is to be used
by the next instruction), the next instructions are executedby the pipeline, generating undesired results. The following
code is an example of pipelined code with
NOP
s inserted:
1 MVK 40,A2
2 loop: LDH *A5++,A03 LDH *A6++,A1
4 NOP 45 MPY A0,A1,A3
6 NOP7 ADD A3,A4,A4
8 SUB A2,1,A29 [A2] B loop10 NOP 5
11 STH A4,*A7
In line 4, we need 4
NOP
s because the
A1
is loaded by the
LDH
instruction in line 3 with 4 delays.
After 4 delays, the value of
A1
is
available to be used in the
MPY A0,A1,A3
in line 5. Similarly, we need 5 delays after the
[A2] B loop
instruction in line 9 to
prevent the execution of
STH A4,*A7
before branching occurs.
The C6x Very Large Instruction Word (VLIW) architecture,
several instructions are captured and processedsimultaneously. This is referred to as a Fetch Packet (FP).
This Fetch Packet allows C6x to fetch eight instructionssimultaneously from on-chip memory. Among the 8 instructions
fetched at the same time, multiple of them can be executed atthe same time if they do not use same CPU resources at the
same time. Because the CPU has 8 separate functional units,maximum 8 instructions can be executed in parallel, although
the type of parallel instructions are limited because theymust not conflict each other in using CPU resources. In
assembly listing, parallel instructions are indicated bydouble pipe symbols (
||
). When writing assembly
code, by designing code to maximize parallel execution ofinstructions (through proper functional unit assignments,
Notification Switch
Would you like to follow the 'Finite impulse response' conversation and receive update notifications?