In the branch delay slot, we edit the return address so that when function1 returns, it resumes execution at resume rather than nominalreturn, thereby avoiding having to execute another branch instruction. When the branch behaves as predicted, the instruction in the branch delay slot is simply executed as it would normally be with a delayed branch. When the branch is incorrectly predicted, the instruction in the branch delay slot is simply turned into a no-op. Examples of such branches are Cancel-if-taken or Cancel-if-not-taken branches. The nopinstruction in the branch delay slot is executed. Then execution continues with the first instruction of the subroutine at 0x00400100. Control has been passed to the subroutine and the return address in $ra.
RSIM_EVENT and the Out-of-order Previous:Overview of RSIM_EVENTSource files: src/Processor/pipestages.cc, src/Processor/tagcvt.cc,src/Processor/active.cc, src/Processor/stallq.cc
Headers: incl/Processor/state.h, incl/Processor/instance.h, incl/Processor/instruction.h, incl/Processor/mainsim.h, incl/Processor/decode.h, incl/Processor/tagcvt.h, incl/Processor/active.h, incl/Processor/stallq.h
Since RSIM currently does not model an instruction cache, theinstruction fetch and decode peline stages are merged. This stagestarts with the function decode_cycle, called from maindecode.
The function decode_cycle starts out by looking in theprocessor stall queue, which consists of instructions that weredecoded in a previous cycle but could not be added to the processoractive list, either because of insufficient renaming registers orinsufficient active list size. The processor will stop decoding newinstructions by setting the processor field stall_the_restafter the first stall of this sort, so the stall queue should have atmost one element. If there is an instruction in the stall queue, check_dependencies is called for it (described below). If thisfunction succeeds, the instruction is removed from the processor stallqueue. Otherwise, the processor continues to stall instructiondecoding.
After processing the stall queue, the processor will decode theinstructions for the current cycle. If the program counter is validfor the application instruction region, the processor will read theinstruction at that program counter, and convert the static instr data strucutre to a dynamic instance data structurethrough the function decode_instruction. The instance isthe fundamental dynamic form of the instruction that is passed amongthe various functions in RSIM. If the program counter is not valid forthe application, the processor checks to see if the processor is inprivileged mode. If so, and if the program counter points to a validinstruction in the trap-table, theprocessor reads an instruction from the trap-table instead. If theprocessor is not in privileged mode, or the PC is not valid in thetrap-table, the processor generates a single invalid instruction thatwill cause an illegal PC exception. Such a PC can arise through either anillegal branch or jump, or through speculation (in which case theinvalid instruction will be flushed before it causes a trap).
The decode_instruction function sets a variety of fields in theinstance data structure. First, the various fields associatedwith the memory unit are cleared, and some fields associated withinstruction registers and results are cleared. The relevant statisticsfields are also initialized.
Then, the tag field of the instance is set to holdthe value of the processor instruction counter. The tag field isthe unique instruction id of the instance; currently, this fieldis set to be unique for each processor throughout the course of asimulation. Then, the win_num field of the instance isset. This represents the processor's register window pointer (cwp or current window pointer) at the time of decoding thisinstruction.
decode_instruction then sets the functional unit type andinitializes dependence fields for this instance. Additionally,the stall_the_rest field of the processor is cleared; since anew instruction is being decoded, it is now up to the progress of thisinstruction to determine whether or not the processor will stall.
At this point, the instance must determine its logical sourceregisters and the physical registers to which they are mapped. In thecase of integer registers (which may be windowed), the function convert_to_logical is called to convert from a window number andarchitectural register number to an integer register identifier thatidentifies the logical register number used to index into the register map table(which does not account for register windows).If an invalid source register number is specified, theinstruction will be marked with an illegal instruction trap.
At this point, the instance must handle the case where it is aninstruction that will change the processor's register window pointer(such as SAVE or RESTORE). The processor provides twofields (CANSAVE and CANRESTORE) that identify the numberof windowing operations that can be allowed toproceed [23]. If the processor can not handle the currentwindowing operation, this instance must be marked with a registerwindow trap, which will later be processed by the appropriate traphandler. Otherwise, the instance will change its win_numto reflect the new register window number.
In a release consistent system, the processor will now detect MEMBAR operations and note the imposed ordering constraints. Theseconstraints will be used by the memory unit.
The instance will now determine its logical destination registernumbers, which will later be used in the renaming stage. If theprevious instruction was a delayed branch, it would have set the processor'scopymappernext field (as described below). If the copymappernext field is set,then this instructionis the delay slot of the previous delayed branch and must try to allocate a shadowmapper. The branchdep field of the instance is set toindicate this.
Now the processor PC and NPC are stored with each created instance. We store program counters with each instruction not toimitate the actual behavior of a system, but rather as a simulatorabstraction. If the instance is a branch instruction, thefunction decode_branch_instruction is called to predict or setthe new program counter values; otherwise, the PC is updated to theNPC, and the NPC is incremented. decode_branch_instruction mayalso set the branchdep field of the instance (forpredicted branches that may annul the delay slot), the copymappernext field of the processor (for predicted, delayedbranches), or the unpredbranch field of the processor (forunpredicted branches).
If the instance is predicted as a taken branch, then theprocessor will temporarily set the stall_the_rest field toprevent any further instructions from being decoded this cycle, as wecurrently assume that the processor cannot decode instructions fromdifferent regions of the address space in the same cycle.
After this point, control returns to decode_cycle. Thisfunction now adds the decoded instruction to the tag converter, astructure used to convert from the tag of the instanceinto an instance data structure pointers. This structure isused internally for communication among the modules of the simulator.
Now the check_dependencies function is called for the dynamicinstruction. If RSIM was invoked with the ``-q' option and there aretoo many unissued instructions to allow this one into the issuewindow, this function will stall further decoding and return. If RSIMwas invoked with the ``-X' option for static scheduling and even oneprior instruction is still waiting to issue (to the ALU, FPU, or address generation unit), further decoding is stopped andthis function returns.Otherwise,this function will attempt to provide renaming registers for each ofthe destination registers of this instruction, stalling if there arenone available. As each register is remapped in this fashion, the oldmapping is added to the active list (so that the appropriate registerwill be freed when this instruction graduates), again stalling ifthe active listhas filled up. It is only after this point that a windowinginstruction actually changes the register window pointer of theprocessor, updating the CANSAVE and CANRESTORE fieldsappropriately. Note that single-precision floating point registers(referred to as REG_FPHALF) are mapped and renamed according todouble-precision boundaries to account for the register-pairingpresent in the SPARC architecutre [23]. As a result,single-precision floating point codes are likely to experiencesignificantly poorer performance than double-precision codes, actuallyexperiencing the negative effects of anti-dependences andoutput-dependences which are otherwise resolved by register renaming.
If a resource was not available at any point above, check_dependencies will set stall_the_rest and return anerror code, allowing the instance to be added to the stallqueue. Although the simulator assumes that there are enough renamingregisters for the specified active-list size by default, check_dependences also includes code to stall if the instructioncould not obtain its desired renaming registers.
After the instance has received its renaming registers andactive list space, check_dependences continues with furtherprocessing.If the instruction requires a shadow mapper (has branchdep set to2, as described above), the processor tries to allocate a shadow mapper by calling AddBranchQ. If a shadow mapper is available, the branchdepfield is cleared. Otherwise, the stall_the_rest field of theprocessor is set and the instance is added to the queue ofinstructions waiting for shadow mappers. If the processor had its unpredbranch field set, the stall_the_rest field is set,either at the branch itself (on an annulling branch), or at the delayslot (for a non-annulling delayed branch).
The instance now checks for outstanding register dependences.The instance checks the busy bit of each source register (forsingle-precision floating-point operations, this includes thedestination register as well). For each busy bit that is set, theinstruction is put on a distributed stall queue for theappropriate register. If any busy bit is set, the truedep fieldis set to 1. If the busy bits of rs2 or rscc are set, theaddrdep field is set to 1 (this field is used to allow memoryoperations to generate their addresses while the source registers fortheir value might still be outstanding).
If the instruction is a memory operation, it is now dispatched to thememory unit, if there is space for it. If there is no space, eitherthe operation is attached to a queue of instructions waiting for thememory unit (if the processor has dynamic scheduling and ``-q' was not used to invoke RSIM), or the processoris stalled until space is available (if the processor hasstatic scheduling, or has dynamic scheduling with the ``-q' option toRSIM).
If the instruction has no true dependences, the SendToFUfunction is called to allow this function to issue in the next stage.
decode_cycle continues looping until it decodes all theinstructions it can (and is allowed to by the architectural specifications)in a given cycle.
On the MIPS architecture, jump and branch instructions have a 'delay slot'. This means that the instruction after the jump or branch instruction is executed before the jump or branch is executed.
In addition, there is a group of 'branch likely' conditional branch instructions in which the instruction in the delay slot is executed only if the branch is taken.
The MIPS processors execute the jump or branch instruction and the delay slot instruction as an indivisible unit. If an exception occurs as a result of executing the delay slot instruction, the branch or jump instruction is not executed, and the exception appears to have been caused by the jump or branch instruction.
This behavior of the MIPS processors affects both the TotalView instruction step command and TotalView breakpoints.
The TotalView instruction step command will step both the jump or branch instruction and the delay slot instruction as if they were a single instruction.
If a breakpoint is placed on a delay slot instruction, execution will stop at the jump or branch preceding the delay slot instruction, and TotalView will not know that it is at a breakpoint. At this point, attempting to continue the thread that hit the breakpoint without first removing the breakpoint will cause the thread to hit the breakpoint again without executing any instructions. Before continuing the thread, you must remove the breakpoint. If you need to reestablish the breakpoint, you might then use the instruction step command to execute just the delay slot instruction and the branch.
A breakpoint placed on a delay slot instruction of a branch likely instruction will be hit only if the branch is going to be taken.