# T16 Advanced Processors: VLIW Processors

School of Electrical and Computer Engineering  
Cornell University  

revision: 2015-11-30-13-42

## 1 Motivating VLIW Processors

## 2 PARCv1 VLIW Processor

## 3 VLIW Compilation Techniques

<table>
<thead>
<tr>
<th>Section</th>
<th>Title</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>3.1</td>
<td>Loop Unrolling</td>
<td>8</td>
</tr>
<tr>
<td>3.2</td>
<td>Software Pipelining</td>
<td>12</td>
</tr>
<tr>
<td>3.3</td>
<td>Loop Unrolling and Software Pipelining</td>
<td>14</td>
</tr>
<tr>
<td>3.4</td>
<td>Other Compiler Techniques</td>
<td>16</td>
</tr>
</tbody>
</table>
1. Motivating VLIW Processors

We will use a simple vector-scalar multiply (VSMUL) kernel to explore both VLIW and SIMD processors, so we first need to establish the baseline performance of this kernel on our canonical PARCv1 in-order single-issue processor and PARCv1 out-of-order superscalar processor.

```
for ( int i = 0; i < n; i++ )
    dest[i] = src[i] * coeff;
```

For all problems we will assume \( n \) is 64.

```
loop:
    lw  r1, 0(r2)
    mul r3, r1, r4
    sw  r3, 0(r5)
    addiu r2, r2, 4
    addiu r5, r5, 4
    addiu r7, r7, -1
    bne r7, r0, loop
```

Calculate the performance of this loop on the canonical PARCv1 in-order single-issue processor with a four-cycle iterative multiplier.
Calculate performance of VSMUL on the canonical PARCv1 OOO quad-issue processor with register renaming, OOO memory disambiguation, perfect branch prediction, and speculative execution.

```
lw  r1, 0(r2)
mul r3, r1, r4
sw  r3, 0(r5)
addiu r2, r2, 4
addiu r5, r5, 4
addiu r7, r7, -1
bne r7, r0, loop

opA
lw  r1, 0(r2)
mul r3, r1, r4
sw  r3, 0(r5)
addiu r2, r2, 4
addiu r5, r5, 4
addiu r7, r7, -1
bne r7, r0, loop

opA
```
Superscalar Control Logic Scaling

- Each issued instruction must somehow check against $W \times L$ other instructions, i.e., hardware scaling $\propto W \times (W \times L)$

- For in-order issue
  - $L$ is related to pipeline latencies
  - Check is done in the I stage (potentially using a scoreboard)

- For out-of-order issue
  - $L$ is related to pipeline latencies and time spent waiting in the IQ/ROB
  - Check is done by broadcasting tags to waiting instructions at completion
  - As $W$ increases, we need more instructions in-flight (i.e., waiting in IQ/ROB) to find enough ILP to keep functional units busy
  - Out-of-order control logic scales faster than $W^2$ (more like $W^3$)
**VLIW = Very Long Instruction Word**

**Key Idea:** Replace a traditional sequential ISA with a new ISA that enables the compiler to encode instruction-level parallelism (ILP) directly in the hardware/software interface.

- Multiple “sub-instructions” packed into one long instruction
- Each “slot” in a VLIW instruction for a specific functional unit
- Sub-instructions within a long instruction *must* be independent
2. PARCv1 VLIW Processor

- Compiler is responsible for avoiding all hazards!
  - Must use nops to avoid RAW hazards
  - Must use branch delay slots to avoid control hazards
  - Must use “slots” in VLIW instruction to avoid structural hazards
  - Must use extra architectural registers to avoid WAW/WAR name hazards

2. PARCv1 VLIW Processor

- PARCv1 VLIW ISA
  - Sub-instructions in VLIW instruction must be independent
  - 4-cycle Y-pipe, 1-cycle X-pipe, 2-cycle L-pipe, 2-cycle S-pipe
  - No hazard checking, assume ISA enables setting bypass muxing
  - Branch delay slot with two VLIW instructions
  - Early commit point in D
  - Stores go into memory system in S1

\[
\begin{align*}
\text{PARCv1 VLIW Instruction} & \quad \text{Y-pipe} \quad \text{X-pipe} \quad \text{L-pipe} \quad \text{S-pipe} \\
\multicolumn{4}{c}{\text{mul } r1, r2, r3 \quad \text{addu } r4, r1, r5 \quad \text{lw } r6, 0(r7) \quad \text{sw } r6, 0(r8)}
\end{align*}
\]
### 2. PARCv1 VLIW Processor

**loop:**

- `lw`  `r1, 0(r2)`
- `mul` `r3, r1, r4`
- `sw`  `r3, 0(r5)`
- `addiu` `r2, r2, 4`
- `addiu` `r5, r5, 4`
- `addiu` `r7, r7, -1`
- `bne`  `r7, r0, loop`

<table>
<thead>
<tr>
<th></th>
<th>Y-pipe</th>
<th>X-pipe</th>
<th>L-pipe</th>
<th>S-pipe</th>
</tr>
</thead>
<tbody>
<tr>
<td>lw</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>mul</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>sw</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>addiu</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>addiu</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>bne</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
3. VLIW Compilation Techniques

We will explore several compiler techniques that are critical for achieving high-performance on VLIW processors (note that some of these techniques can help improve performance on traditional processors too!):

- Loop unrolling
- Software pipelining
- Loop unrolling + software pipelining
- Other techniques: trace scheduling, instruction encoding

3.1. Loop Unrolling

```c
int j = 0;
for ( int i = 0; i < n; i += 4 ) {
    dest[i+0] = src[i+0] * coeff;
    dest[i+1] = src[i+1] * coeff;
    dest[i+2] = src[i+2] * coeff;
    dest[i+3] = src[i+3] * coeff;
    j += 4;
}
for ( ; j < n; j++ )
    dest[j] = src[j] * coeff;
```

- Loop unrolling amortizes loop overhead
- Loop unrolling requires many arch regs for sw renaming; increases static code size
- Need to carefully handle cases where n is not divisible by 4; add extra “fix-up code” after unrolled loop

loop:

```assembly
lw r1, 0(r2)
lw rA, 4(r2)
lw rB, 8(r2)
lw rC, c(r2)
mul r3, r1, r4
mul rD, rA, r4
mul rE, rB, r4
mul rF, rC, r4
sw r3, 0(r5)
sw rD, 4(r5)
sw rE, 8(r5)
sw rF, c(r5)
addiu r2, r2, 16
addiu r5, r5, 16
addiu r7, r7, -4
bne r7, r0, loop
```
3. VLIW Compilation Techniques

3.1. Loop Unrolling

```assembly
loop:
lw    r1, 0(r2)
lw    rA, 4(r2)
lw    rB, 8(r2)
lw    rC, c(r2)
mul   r3, r1, r4
mul   rD, rA, r4
mul   rE, rB, r4
mul   rF, rC, r4
sw    r3, 0(r5)
sw    rD, 4(r5)
sw    rE, 8(r5)
sw    rF, c(r5)
addiu r2, r2, 16
addiu r5, r5, 16
addiu r7, r7, -4
bne   r7, r0, loop
```

<table>
<thead>
<tr>
<th></th>
<th>Y-pipe</th>
<th>X-pipe</th>
<th>L-pipe</th>
<th>S-pipe</th>
</tr>
</thead>
<tbody>
<tr>
<td>lw</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>lw</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>lw</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>lw</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>lw</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>lw</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>lw</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>mul</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>mul</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>mul</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>mul</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>mul</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>sw</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>sw</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>sw</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>sw</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>sw</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>addiu</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>addiu</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>addiu</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>bne</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Unroll by factor of eight?

<table>
<thead>
<tr>
<th>Y-pipe</th>
<th>X-pipe</th>
<th>L-pipe</th>
<th>S-pipe</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Calculate performance of VSMUL on the canonical PARCv1 OOO quad-issue processor with register renaming, OOO memory disambiguation, perfect branch prediction, and speculative execution. Assume we unroll the loop by a factor of four.

<table>
<thead>
<tr>
<th>lw</th>
<th>r1, 0(r2)</th>
</tr>
</thead>
<tbody>
<tr>
<td>lw</td>
<td>rA, 4(r2)</td>
</tr>
<tr>
<td>lw</td>
<td>rB, 8(r2)</td>
</tr>
<tr>
<td>lw</td>
<td>rC, c(r2)</td>
</tr>
<tr>
<td>mul</td>
<td>r3, r1, r4</td>
</tr>
<tr>
<td>mul</td>
<td>rD, rA, r4</td>
</tr>
<tr>
<td>mul</td>
<td>rE, rB, r4</td>
</tr>
<tr>
<td>mul</td>
<td>rF, rC, r4</td>
</tr>
<tr>
<td>sw</td>
<td>r3, 0(r5)</td>
</tr>
<tr>
<td>sw</td>
<td>rD, 4(r5)</td>
</tr>
<tr>
<td>sw</td>
<td>rE, 8(r5)</td>
</tr>
<tr>
<td>sw</td>
<td>rF, c(r5)</td>
</tr>
<tr>
<td>addiu</td>
<td>r2, r2, 16</td>
</tr>
<tr>
<td>addiu</td>
<td>r5, r5, 16</td>
</tr>
<tr>
<td>addiu</td>
<td>r7, r7, -4</td>
</tr>
<tr>
<td>bne</td>
<td>r7, r0, loop</td>
</tr>
</tbody>
</table>
3.2. Software Pipelining

Take instructions from multiple iterations to create new loop that can run at higher peak throughput.

- Start with original loop and focus on core computation
- Create “software” pipeline diagram
- Create prologue to fill the pipeline
- Create main loop body
- Create epilogue to drain the pipeline
3. VLIW Compilation Techniques

3.2. Software Pipelining

lw r1, 0(r2)     mul r3, r1, r4
addiu r2, r2, 4 lw r1, 0(r2)     addiu r2, r2, 4

loop:
sw r3, 0(r5)     mul r3, r1, r4
lw r1, 0(r2)     addiu r2, r2, 4
addiu r5, r5, 4 addiu r7, r7, -1
bne r7, r0, loop

sw r3, 0(r5)     addiu r5, r5, 4
mul r3, r1, r4   sw r3, 0(r5)

• Software Pipelining vs Loop Unrolling
  – Produces more compact code
  – Uses less registers
  – Can better handle irregularly sized input arrays
  – Quickly get up to peak throughput, one epilogue/prologue per loop
  – Software pipelining does not reduce loop overhead
3.3. Loop Unrolling and Software Pipelining

- Use loop unrolling to amortize loop overhead
- Use software pipelining to quickly reach full throughput (and also reduce code size, register pressure)
<table>
<thead>
<tr>
<th>Y-pipe</th>
<th>X-pipe</th>
<th>L-pipe</th>
<th>S-pipe</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
3.4. Other Compiler Techniques

- Problem: What if there are no loops?
  - Branches limit basic block size in control-flow intensive irregular code
  - Only one branch per VLIW instruction
  - Difficult to find ILP in individual basic blocks

- Trace scheduling
  - Use predication to create larger basic blocks
  - Use profiling feedback or compiler heuristics to find common branch paths
  - Pick trace of basic blocks along common path
  - Schedule whole trace at once
  - Add “fix-up code” to cope with branches jumping out of trace

VLIW Instruction Encoding

- Problem: VLIW encodings require many NOPs which waste space
- Compressed format in memory, expand on instruction cache refill
- Provide a single-op VLIW instruction
- Create variable sized “instruction bundles”

Memory Latency Register (MLR)

- Problem: Loads have variable latency
- Compiler schedules code for maximum load-use distance
- Compiler sets MLR to latency that matches code schedule
- Hardware ensures that loads take exactly MLR cycles
  - Hardware buffers loads that return early
  - Hardware stalls processor if loads return late