CPU Performance Improvement Techniques

Overview#

There are two indicators of CPU performance: response time and throughput.

This article will focus on the following points:

Pipeline technology
Dealing with the three hazards encountered in pipeline technology
Out-of-order execution
Improving throughput with superscalar technology

Pipeline Technology#

The execution process of a CPU instruction can be divided into three simple steps:

Instruction fetch
Instruction decode
Instruction execution

Since fetching an instruction requires a clock signal to increment the PC register, completing an instruction takes at least one clock cycle.

In the original design, we wanted the best performance, so we needed to complete only one instruction in one clock cycle. However, the complexity of each instruction varies, while the clock cycle is fixed, so the length of the clock cycle must be equal to the execution time of the longest instruction.

The drawback of this design is obvious: it wastes a lot of time and resources.

Regardless of whether the instructions we run are complex or simple, we must wait for a full clock cycle before running the next instruction. During this time, many CPU resources, such as the ALU, may remain idle for a long time, resulting in resource waste.

Therefore, the CPU needs a technology that can minimize the waste of time and hardware resources when running instructions.

Modern CPUs often use pipeline technology to solve this problem.

In simple terms, a pipeline is a process that allows multiple instructions to overlap and execute. We divide a task into several steps, and a pipeline can handle each step of an instruction. At the same time, a task will only be in one step at any given moment. When the first task enters the pipeline and reaches step 2 from step 1, the pipeline frees up the part that handles step 1, allowing the second task to enter the pipeline and perform step 1 without waiting for the first task to complete.

Regarding the pipeline of instructions, the number of pipeline stages needs to be defined, which is the number of steps mentioned earlier.

Different CPUs have different numbers of stages. Some are more fine-grained and have dozens of stages, while others have only a few stages.

Taking the ARMv8's sub-instruction set, LEGv8, as an example, it is divided into five processing steps:

Fetch instruction from the instruction register
Read registers and decode through the address decoder
Execute the instruction
Read/write memory
Write back to registers

Each operation in a pipeline stage takes up one clock cycle, and an instruction ultimately requires the number of pipeline stages clock cycles.

The length of the clock cycle no longer needs to reach the time it takes for one instruction to complete. It only needs to reach the duration of the most complex operation in the pipeline stages.

At the same time, the execution time of a single instruction has not been optimized by pipeline technology. Pipelining improves performance by increasing the throughput of instructions.

In addition, the number of pipeline stages cannot be too high because increasing the depth of the pipeline has its costs.

The output of each pipeline stage needs to be saved in additional pipeline registers. Although the read and write speed of pipeline registers is fast, once the number of stages increases, such as reaching thirty or forty, it will increase response time and increase power consumption.

The Three Hazards in Pipelining#

A pipeline can encounter a situation where the next instruction cannot be executed in the next clock cycle. This situation is called a hazard.

Although hazards pose a crisis, they also bring opportunities. If we solve the problems encountered in pipeline technology well, we can further improve the throughput of the CPU.

There are three types of hazards in pipeline technology: structural hazards, data hazards, and control hazards.

Structural Hazards#

The essence of a structural hazard is a resource competition problem caused by insufficient hardware resources.

In a pipeline, if two instructions in different stages both require the same hardware resource, and there is only one copy of this resource, a structural hazard will occur, preventing the instructions from running smoothly.

Taking the five steps mentioned earlier as an example, both step 2 and step 4 require the use of the address decoder. Since there is only one address decoder, there will be a resource conflict between fetching the instruction address and fetching the address of the data in memory.

There are two solutions to this problem.

One is to divide the memory into two parts, one for data and one for instructions, each with its own decoder. The disadvantage is obvious: since the memory is divided in half, dynamic allocation is not possible, resulting in a loss of flexibility.

Another solution is to divide the CPU's internal cache into separate instruction and data caches. This can solve the resource conflict problem in structural hazards without affecting the dynamic allocation of memory.

The former design is called the Harvard architecture, while the latter design is adopted by modern von Neumann architecture.

Data Hazards#

Data hazards occur when one step must wait for another step to complete (i.e., there is a data dependency), resulting in a pipeline stall.

For example, consider the following code:

a := 1
b := 2
a = a + 1
b = a + 1

The final value of variable b depends on the result of variable a after the "add 1" operation. The order of these operations must be maintained.

The simplest solution is to use pipeline bubbling. After the instruction is decoded, it checks if there is a data dependency. If there is, it inserts a NOP operation (which means do nothing) to cause a pipeline stall. The instruction will start executing in the next clock cycle when the data dependency is resolved (since the pipeline relies on clock signals to work, it cannot actually stop, so it needs this NOP operation). This solution has a relatively large performance cost.

A more advanced strategy is operand forwarding, also known as operand bypassing.

Let's take the following two instructions as an example:

add $t0, $s2, $s1
add $s2, $s1, $t0

The execution of the second instruction depends on the value in t0. According to the previous pipeline bubbling strategy, we must wait for the result of the first instruction to be written back to t0 before we can execute the second instruction. However, in fact, the second instruction does not need to wait for the result of s2 + s1 to be written back to t0 before executing. The result of s1 + s2 can be directly passed to the second instruction as the input for t0. This can be achieved by adding some additional circuitry. This is the specific manifestation of operand forwarding, and the additional circuitry is called a bypass.

By using operand forwarding, the number of NOP operations can be effectively reduced, further improving pipeline efficiency.

Control Hazards#

In high-level programs, for loops and if statements generate comparison instructions (such as cmp) and jump instructions (such as jmp) in assembly code. Whether the next instruction in the program counter (PC) register or the address corresponding to jmp is executed depends on the completion of the cmp instruction. When the decision depends on the result of an instruction, while other instructions are being executed, this is a control hazard.

To prevent pipeline stalls, we need to predict the result of the cmp instruction before its result is available. This is called branch prediction.

There are many types of branch prediction techniques. The simplest one is called static branch prediction. The CPU assumes that no jump will occur and continues to execute sequentially. If the prediction is wrong, the instructions executed after the jump need to be discarded, which incurs a certain performance cost.

A more advanced technique is dynamic prediction. One common dynamic prediction scheme is called one-level branch prediction, also known as 1-bit saturating counter. Essentially, it uses a single bit to record the current branch's comparison result and uses this value to predict the next branch. For example, if a jump occurs this time, the next branch will also be a jump; otherwise, the next branch will continue to execute sequentially. There is also a 2-bit saturating counter, which introduces a state machine, and the prediction of the next branch depends on the results of the previous two branches. For more detailed information about branch prediction, please refer to wiki.

Out-of-Order Execution#

Operand forwarding in the previous data hazards section can reduce many unnecessary NOP operations, but in some cases, NOP operations are inevitable. To further improve throughput, the CPU can run instructions that do not have dependencies and can be executed directly during the NOP stage. This technique is called out-of-order execution.

Here is a diagram of the process:

cpu 乱序执行.png

In the reservation station, an instruction waits for its dependent data to arrive before being sent to the functional unit (FU) for execution. This stage is called out-of-order execution.

The so-called FU is actually the ALU, but different ALUs can have different functions, so the out-of-order execution stage needs to maximize the utilization of this feature.

After execution, the instruction needs to be placed in the reorder buffer in the original order.

Furthermore, the result of the instruction calculation is first submitted to the store buffer and then to the CPU's cache or memory.

During memory access and write-back stages, the order must be sequential to prevent data inconsistency caused by multiple cores. Each CPU core has its own cache and registers. If the write order to memory is not consistent, unexpected errors may occur.

From the outside, this process still appears sequential, but internally, each FU is not idle, further increasing the CPU's throughput.

Superscalar Technology#

Even if all the CPU performance improvement techniques mentioned above are used, the instructions per clock (IPC) will not exceed 1 because the CPU can only fetch one instruction in one clock cycle.

Since the execution stage of instructions can be parallelized through out-of-order execution, the fetch and decode stages can also be parallelized. In other words, multiple instructions can be fetched from memory at once and sent to multiple instruction decoders for instruction decoding.

This technology is called multiple-issue and superscalar:

cpu 超标量.png

Modern CPUs adopt this technology, allowing IPC to often exceed 2.

Conclusion#

The purpose of this article is to introduce common techniques for improving CPU performance.

The article starts with an introduction to pipeline technology, then discusses the three hazards in pipelines, involving splitting caches, pipeline bubbling, operand forwarding, branch prediction, and then covers out-of-order execution and superscalar technology, which further utilizes resources in a parallel manner to improve CPU throughput.

By mastering these concepts, you can have a macro understanding of modern CPU performance improvement.