Pipelining Risc Architecture Related Hazards Computer Science Essay

Published: November 9, 2015 Words: 2964

Pipelining, in general is used to speed up the process of work by breaking one large task into a number of different sub tasks, each using different resources. Pipelining is found everywhere from a Car producing industry to the design of most advanced computers. Pipelining does not reduce the time taken to do the work; it increases the effective throughput of the system without any changes to the output. It results in the effective utilization of system resources and is one of the most useful.

INTRODUCTION

The cost of a processor depends on the area of the Silicon. The cost of the processor is directly proportional to the size of the silicon. Hence, lowering the cost is a priority. One of the methods used to reduce the size of the processor is by time multiplexing the resources meaning use the same resource in a general way instead of building different resources with specific functions. This is called a Micro coded processor.

Depending on the instruction, the Micro coded processor takes a large cycle time to process the information decreasing the speed of the processor.

IDEAL PIPELINE

In an Idealized pipeline,

All objects should go through the same stages and all the stages.

There is no sharing of resources between any two stages.

Propagation delay through all the pipeline stages is equal.

Scheduling of an operation going through the pipeline should not affect the ongoing operations.

Figure1. Pipelined Control of RISC Architecture

The clock period is reduced by dividing the execution of a single instruction into multiple parts. The clock period is the maximum of the propagation delay of one of the stages.

Tclock > Max {TFetch, TDecode, TExecute, TMemory, TWriteback}

The clock cycles per instruction (CPI) will increase unless instructions are pipelined. So, a Control structure is required to pipeline the instructions.

Table1.Transactions versus Time Diagram for pipelining

Time

T0

T1

T2

T3

T4

T5

T6

T7

Instruction1

IF1

ID1

EX1

MA1

WB1

Instruction2

IF2

ID2

EX2

MA2

WB2

Instruction3

IF3

ID3

EX3

MA3

WB3

Instruction4

IF4

ID4

EX4

MA4

WB4

IF- Instruction Fetch, ID - Instruction Decode, EX - Execute, MA- Memory Access, WB- Write Back

Although pipelining does not increase the speed of the execution of the instruction, it increases the throughput of the system. It takes five clock cycles for Instruction 1 to execute, but, the output for the instructions following the first is obtained after each clock cycle instead of five clock cycles.

At any point after a certain point of time, all the resources are utilized effectively meaning that the pipeline is full and the pipeline stages do not have multiple instructions at the same time.

Unfortunately, in a microprocessor executing instructions, instructions depend on earlier instruction as in the case of Branch or a Control Instruction causing various hazards. Also the dividing the data path into different equal parts is not practical as memory is slower than the other stages.

NON IDEAL PIPELINE

In a non ideal pipeline, instructions are dependent of each other in a pipeline causing Hazards. There are three main types of hazards

Structural Hazard

Data Hazard

Control Hazard

STRUCTURAL HAZARD

Structural hazard occurs when two instructions need to use the same resource at the same point of time.

Approaches to resolve Structural Hazards

Schedule - Scheduling the instructions, explicitly avoiding the instructions causing structural hazard into the pipeline.

Delay - Stall the instruction in the pipeline until the resource is free to be used. Hardware includes Control Logic to stall the instruction until the resource is free

Additional Resources - Add extra hardware to the design so that each instruction can access independent resources at the same time.

Figure2. Two Cycle Memory

An example for a Structural Hazard can be a two cycle memory, where a single memory is accessed in two clock cycles and there is a chance that two instructions might need to use the memory at the same time in the pipeline. Only one instruction can be accessed during M0 and M1.

Table2. Structural Hazard at time T6

Time

T0

T1

T2

T3

T4

T5

T6

T7

T8

ADD

IF

ID

EX

M0

M1

WB

ADD

IF

ID

EX

M0

M1

WB

LOAD

IF

ID

EX

M0

M1

WB

LOAD

IF

ID

EX

M0

M1

WB

Also, structural hazard does not happen with add and other instructions because it does not access the resource at the same time, unlike load operation.

DATA HAZARDS

Data hazard occurs when one instruction depends on data from another instruction in the pipeline still in the pipeline.

Approaches to resolve Data Hazards

Schedule - Scheduling the instructions around it, explicitly avoiding the instructions which depend on the data of instruction in the pipeline.

Delay - Stall the instruction until the instruction on which the data is dependent has finished execution. Hardware includes Control Logic to freeze the earlier stages until the preceding instruction has finished execution.

Bypass- Add extra hardware to the data path, which allows values to be sent to be sent to an earlier stage before preceding instruction, has left the pipeline.

Speculate - catch the instruction executed with wrong data and execute the instruction again with correct data.

Example-

Instruction 1: R3 = R0 + 1 // Add 1 to the register R0 and save the result in R3

Instruction 2: R5 = R3 + 7 // Add 7 to the register R3 and save the result in R5

Since the value of R3 is not evaluated (still in the pipeline), while instruction 2 fetches for R3, the resulting value is stale.

One way to solve this would be to use a feedback as shown.

Figure3. Feedback to prevent Data Hazard

Later stages provide dependence information to previous stages which can stall or kill the instruction. This works only if the later stage sends feedback information to all the previous stages and so on but no feed forward.

Another way to solve would be to stall the second instruction till the first instruction is completed and the data is written back into memory. The second instruction should be delayed while the first instruction continues down the pipeline.

Figure4. Stalling the instructions to prevent Data Hazard

This delaying (freezing) can be done by inserting a multiplexer on the Instruction Register side, which inserts no operation (nop) instructions so that the first instruction can clear out of the pipe while the second instruction waits for the first one to finish execution. All the later instructions are stalled along with the second instruction.

Table3. Prevention of Data hazard by stalling

Time

T0

T1

T2

T3

T4

T5

T6

T7

T8

INS1

IF1

ID1

EX1

MA1

WB1

INS2

IF2

ID2

nop

nop

nop

EX2

MA2

WB2

INS3

IF3

nop

nop

nop

ID3

EX3

MA3

STALL CONTROL LOGIC

To come up with the Control Logic, detect if an instruction in the decode stage is reading a value from one of the registers used by the instructions in the later stages of the pipeline. Firstly, check the destination for the operand, the register identifier (Ws) with the two source operand register identifiers (Rs and Rt). Compare these registers in the Control Logic and if they match, stall everything earlier and insert No Operation (nop) down the pipeline.

Figure5. Control Logic to stall the instructions

Cstall - Control to stall the instruction

This works but introduces a lot of nop in the pipeline and not every instruction writes to the register. A "store" instruction does not write into the register, which means the Control Logic stalls the pipeline unnecessarily. By, removing the nop when store instruction is used, the performance of the pipeline can be improved. A write enabler is wired to detect these kinds of instructions.

Similarly, not every instruction reads both input operands. An immediate instruction reads only one of the source operands and the other value comes from the immediate bits which are in the instruction coding. A read enable signal is introduced for these kinds of instructions.

Also Jump instructions like Jump and Link or Jump and Link Register has an implicit destination target. The destination is not encoded in the Rd field of the instruction. A multiplexer as shown with a hardcoded value of 31 is used to handle Jump and Link and Jump and Link Register instructions.

Instruction

Function

Source(s)

Destination

ALU

Arithmetic

Functions

Rs, Rt

Rd

ALUI

ALU Functions

(Immediate)

Rs

Rt

LW

Load

Rs

Rt

SW

Store

Rs, Rt

BZ

Branch

Rs

J

PC <= (PC) + immediate

JAL

R31 <= (PC),

PC<=(PC)+ Imm

31

JR

PC<=(Rs)

Rs

JALR

R31 <= (PC),

PC <=(Rs)

Rs

31

An instruction (Cdest) writes the destination or not is determined by Source Operand identifier (Ws) and Write Enable (We).

Cdest

Re1 is true when the instruction takes in atleast the first source operand as register.

ALU, ALUI, LW, SW, JR, JALR - TRUE

J, JAL - FALSE

Re2 is true when the instruction takes the second operand as register.

ALU, SW - TRUE

A comparison between the source register identifier in the decode stage compared with the destination register identifier in the execute stage and other stages (represented by E , M and W) for both source operands.

Stall =

{[(RsD = WsE).WeE + (RsD =WsM).WeM + (RsD =WsW).WeW). Re1D] +

[((RtD =WsE).WeE + (RtD =WsM).WeM + (RtD =WsW).WeW) . Re2D]}

HAZARDS DUE TO LOAD AND STORE

There are a few more complications to the stall equation.

For a load instruction, the Load value is not ready till the write back stage, so there need to be additional stall signals for Load instruction. Load and Store are a bit more complicated because; there might be a data dependency on the Data memory itself.

Example

M [(R1) +7] <= R2)

R4 <= M [(R3) +5]

This might result in a Data Hazard depending on the values of R1 and R3.

If (R1 + 7) = (R3 + 5), the Load instruction needs to pick up the data value of the Store Instruction.

This is avoided in this RISC architecture because writing to memory is very fast and can be read in the next cycle. But in case of realistic memory systems, a closer look is needed.

BYPASS

The stall logic was good enough in terms of correctness, but in terms of performance, it wastes a lot of clock cycles by stalling. Hardware data path allows values to be sent to an earlier stage before preceding instruction has left the pipeline.

In example 1, control logic similar to the stall signal is used as a Select line for the multiplexer ASrc. If there is a Data hazard, the value from the multiplexer is taken instead of the data from the register identifier. By bypassing, the second instruction c n be executed simultaneously.

Time

T0

T1

T2

T3

T4

T5

T6

T7

T8

INS1

IF1

ID1

EX1

MA1

WB1

INS2

IF2

ID2

EX2

MA2

WB2

INS3

IF3

ID3

EX3

MA3

WB3

The load result doesn't until the output of data memory i.e. when it is too late. Hence load instruction along with the dependent instruction still needs to be stalled.

Example3. R1 <= Mem[R0 + 10]

R4 <=R1 + 17

Similarly, Jump and Link instructions do not work ,in the above architecture.

Example4. JAL 500

R4 <=R31 + 17

After by passing, in Example 1,

Stall = [(RsD =WsM).WeM + (RsD =WsW).WeW) . Re1D] +

[((RtD =WsE).WeE + (RtD =WsM).WeM + (RtD =WsW).WeW) . Re2D]}

ASrc = (RsD = WsE).WeE

Only, ALU and ALUI can benefit from this bypassing

Stall = [(RsD = WsE).We- Stall +(RsD =WsM).WeM + (RsD =WsW).WeW) . Re1D] +

[((RtD =WsE).WeE + (RtD =WsM).WeM + (RtD =WsW).WeW) . Re2D]}

ASrc = (RsD = WsE).We-Bypass.

FULLY BYPASSED DATAPATH

A fully data path allows for getting data from all the destination stages which takes care of JAL and other type of instructions. But this still requires a Stall signal for instructions dependent because the value of the load instruction till the end of all stages. The stall equation reduces to

Stall = (rsD=wsE). (opcode=LWE).(wsE != 0 ).re1D

+ (rtD=wsE). (opcodeE=LWE).(wsE !=0 ).re2D

CONTROL HAZARDS

A control hazard takes place when instruction to be executed depends on a control decision made by an earlier instruction. This involves the change in the value of Program Counter in a different manner dependent on the instruction.

Different instructions need different values to determine the value of the next program counter.

1. JUMP - The opcode to make sure it is a jump instruction, the Offset value of jump and the value of the Program counter is required to determine the next value.

2. JUMP REGISTER - The opcode to determine the instruction and the value in the register to jump to the location directly.

3. CONDITIONAL JUMP- The opcode to determine the conditional jump instruction, the value of the Program Counter, the register which gives the condition whether or not to jump and the offset value to add to the Program Counter.

4. OTHER INSTRUCTIONS - The opcode and the value of the Program Counter to which 4 is added (If the instruction is 4 bytes long).

The opcode doesn't get decoded till the decode stage. Register do not get fetched till Instruction fetch or decode stage. For conditional Jump , there is also a condition check. A basic control hazard would be execute instructionand fall through the next instruction. Assuming there are no Branch Delays

The first instruction goes through the pipeline followed by the second instruction. First goes into the fetch stage. But the problem here is, second instruction needs to be stalled at the fetch stage, because the function second instruction is unknown. For instance, first instruction may a branch instruction, or may be an ALU instruction. If it is a jump, the next address of the instruction is not yet determined.

The common term in the common through all these different instruction is they all need to decode the opcode before determining the function, But since this is not determined till the Decode stage of the Pipeline.

Time

T0

T1

T2

T3

T4

T5

T6

T7

Ins 1

IF1

ID1

EX1

MA1

WB1

Ins2

nop

IF2

ID2

EX2

MA2

WB2

Ins3

nop

IF3

ID3

EX3

MA3

Time

T0

T1

T2

T3

T4

T5

T6

T7

T8

IF

I1

nop

I2

nop

I3

ID

I1

nop

I2

nop

I3

EX

I1

nop

I2

nop

I3

MA

I1

nop

I2

nop

I3

WB

I1

nop

I2

nop

I3

This means the machine is running at half the desired performance.

SPECULATE

To mitigate the Above Control Hazard, one of the things that can be done is Speculate. In this , assume that the next instructionis not going to be a Branch or a Jump instruction and calculate the next value of program counter.

Example I1 096 ADD

I2 100 JMP 304

I3 104 ADD

I4 304 ADD

The adder adds 4 to the Program Counter i.e. 096 , 100 , 104 etc. If there is a jump, kill the instructions in the pipeline. The ISrc will be a nop instruction when there is a Jump instruction and the new value of the Program Counter is calculated.

I2 is a jump instruction , kill the instruction I3 and load the value 304 into the program counter. The execution continues as usual.

Time

T0

T1

T2

T3

T4

T5

T6

T7

IF

I1

I2

I3

I4

ID

I1

I2

nop

I4

EX

I1

I2

nop

I4

MA

I1

I2

nop

I4

WB

I1

I2

nop

I4

PIPELINING BRANCH INSTRUCTIONS

Example I1 096 ADD

I2 100 BEQZ R1 +200

I3 104 ADD

I4 304 ADD

There needs to be a comparison to determine if there should be a branch or not . Also since there is an instruction fetched and another in currently being decoded while the comparison is made, instead of one kill cycle , there needs to be two kill cycles. There are two multiplexers in the design - one at Decode and Execute to kill the instructions in the pipeline if there needs to be a branch.

Now the stall signal and the kill signal makes a lot of difference .So the stall signal needs to be effectively nullified. If the stall is given priority, there is no way to kill the previous instruction without stalling and the instruction in the decode stage is going to be invalid. So , Kill needs to take precedence.

Now , if the branch is happening, there should not be any stall and if there no branch , the instruction should be stalled.

New stall equation becomes,

Stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D + ((rtD =wsE).weE + (rtD =wsM).weM + (rtD=wsW).weW).re2D).!((opcodeE=BEQZ).z+(opcodeE=BNEZ).!z)

Time

T0

T1

T2

T3

T4

T5

T6

T7

IF

I1

I2

I3

I4

I5

ID

I1

I2

I3

nop

I5

EX

I1

I2

nop

nop

I5

MA

I1

I2

nop

nop

I5

WB

I1

I2

nop

nop

The CPI will increase to three in case of a Branch instruction. To decrease this , a comparator can be added in the decode stage to reduce one nop in the pipeline.

For an instruction, it needs to decode as a Branch Instruction and knowledge if it a Branch Zero or Branch not Zero in the architecture. A zero detector in the register files output. This reduces the CPI from three to two when miss speculated but at the expense of lowering the Clock Frequency ( Increase in Clock Cycle time)

Another approach would be to expose the drawbacks to the Software , by changing the Instruction Architecture by changing the semantics of the Jump or Branch Instruction that the next instruction is executed regardless of the Jump or Branch.

Much more Advanced technique would be using Branch Prediction which dramatically reduces Branch penalty.

OTHER TYPES OF HAZARDS

1.EXCEPTIONS

2.INTERRUPTS