辅导案例-CS520
CS520: Instruction‐level Parallelism Homework 3 – Spring 2020 RULES 1. You are to work individually on the homework. Cooperation is not allowed. 2. Points are assigned to each question to reflect the amount of work involved or its level of difficulty. Your homework score will be calculated as: score = ଵ௫௨ ௧௦ ൈ . HOMEWORK PROBLEMS 1. (20 Points) For the following instructions: PC1: load F3, 0 (R0) // Result: F3 = 100 PC2: load F1, 45 (R1) // Result: F1 = 200 PC3: subd F2, F1, F3 // F2 = F1 – F3 PC4: adddi F3, F4, #1 // F3 = F4 + 1 PC5: multd F3, F4, F5 // F3 = F4 x F5 Use the diagram for your final answer. Show the state of the pipeline (Register File, Reservation Stations, and Reorder Buffer including head and tail pointers) given that all of the following conditions hold: Instructions prior to the ones shown above have retired. all instructions have been dispatched, and the ADDD instruction has issued to a FU but has not completed (it is executing in a FU), and the first load instruction has completed and retired, and the second load instruction has completed but has not retired. Note that R* are integer registers while F* are floating‐point registers. Show the state of the pipeline by filling in the contents of the Register File, Reservation Stations, and Reorder Buffer as needed. Don’t forget to complete the arrows for the ROB head and tail pointers. You can ignore the value fields except when it contains tags. Use the following tables as a template for your answer. Initial Register Map/Alias Table and Register File: In RF? Tag Value R0 yes 0x00F0A000 R1 yes 0x00F0B000 F0 yes 10 F1 yes 20 F2 yes 30 F3 yes 40 F4 yes 50 F5 yes 60 Register Map/Alias Table and Register File: In RF? Tag Value R0 R1 F0 F1 F2 F3 F4 F5 Reservation Stations for floating‐point units: Tag src1 ready src1 tag/value src2 ready src2 tag/value Reorder Buffer: Entry Dest Result Exception Completed PC ROB1 head/tail ROB2 ROB3 ROB4 ROB5 ROB6 ROB7 ROB8 2. (20 Points) Branch Prediction in Superscalar Architecture. (a) With superscalar and dynamic scheduling, it is possible to execute and complete several instructions at once. Assume that we have a 4‐wide superscalar architecture with 25‐stage pipeline, with instruction fetch as the first stage, and commit as the last stage. Assume an architecture in which branch misprediction is handled like an exception (i.e., at the retire/commit stage). Also assume that the processor is able to fill the pipeline if there is no branch misprediction. Suppose that one out of every five instructions is a conditional branch. If the branch misprediction rate is 10%, is the pipeline full before it needs to be flushed (due to misprediction handling)? If not, what is the required misprediction rate needed so that the pipeline is full when a branch misprediction is handled? (b) For the pipeline in part (a), we need to generate one branch prediction approximately every cycle. Consider a Gshare branch predictor. When a branch prediction has to be made, the actual outcome of the previous branches is not known yet, and thus have not updated the history register. This makes the history register out‐of‐date and may not be used for current branch prediction accurately. A solution to this is to update the history register early, using branch prediction results instead of the actual outcome of the branch. Discuss the advantages and disadvantages of this scheme. 3. (30 Points) For the following instructions, give the timing diagram: (1) set F1, avalue (2) addi F2, F1, 1 (3) multdi F3, F2, x // F3 = F2 * x (4) load F2, &c (5) addi F1, F2, 2 (6) addi F2, F3, 1 (7) add F1, F1, F2 Suppose that the pipeline has one‐stage fetch (IF), one‐stage decode and dispatch (ID), some execute stages (explained later), one‐stage write back (WB), and there is no support for precise exception (i.e. no retirement/commit stage). Suppose that all instructions take 2 cycles in the execute stage, except the multd which take 4 cycles in the execute stage (EX). Suppose that memory instructions (load/store) take 2 cycles in an adder unit (EX) to calculate target addresses and 2 extra cycles in the memory unit (MEM). Suppose that we have a 2‐way superscalar processor as follows. In each cycle, the processor can fetch up to 2 instructions and decode up to 2 instructions. It can issue and execute more than 2 instructions if the functional units are available. Assume there are unlimited reservation station entries. There is a full forwarding/bypassing network. There is no register renaming, hence WAR and WAW dependences are not eliminated. To handle WAW hazard between two instructions, delay the write‐back of the latter instruction until the earlier one has written back its result. To handle WAR hazard between two instructions, delay the write‐back of the latter instruction until the earlier one has read all its operands (i.e. has passed the ID stage). (a) Add assumption of unlimited number of functional units. Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 set F1, avalue IF addi F2, F1, 1 IF multdi F3, F2, x load F2, &c addi F1, F2, 2 addi F2, F3, 1 add F1, F1, F2 15 16 17 18 19 20 21 22 23 24 25 26 27 28 (b) Suppose we only have 1 adder unit that performs set/addi/add, and 1 multiplier unit that performs multd. These units are not pipelined, i.e. there can be only one instruction in a functional unit at any given time. Also, there is no extra adder unit for memory instructions (only one adder). Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 set F1, avalue IF addi F2, F1, 1 IF multdi F3, F2, x load F2, &c addi F1, F2, 2 addi F2, F3, 1 add F1, F1, F2 15 16 17 18 19 20 21 22 23 24 25 26 27 28 (c) Suppose we only have 1 adder unit that performs set/addi/add, and 1 multiplier unit that performs multd. Keep the multiplier unit unpipelined, but now suppose that the adder unit is pipelined, i.e. there can be at most 2 instructions in the adder at any given time, 1 instruction in EX1 stage, and another in EX2 stage. An exception to this is when there is true dependence, where the first instruction produces result only after EX2 stage, while the second instruction needs the produced value before it can enter the functional unit. Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 set F1, avalue IF addi F2, F1, 1 IF multdi F3, F2, x load F2, &c addi F1, F2, 2 addi F2, F3, 1 add F1, F1, F2 15 16 17 18 19 20 21 22 23 24 25 26 27 28 (d) Suppose we have unlimited number of functional units. However, assume that WAW is solved at dispatch stage (ID). Hence, to handle WAW hazard between two instructions, delay the execution of the latter instruction until the earlier one has written back its result. Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 set F1, avalue IF addi F2, F1, 1 IF multdi F3, F2, x load F2, &c addi F1, F2, 2 addi F2, F3, 1 add F1, F1, F2 15 16 17 18 19 20 21 22 23 24 25 26 27 28 4. (30 Points) You are given the following instructions, the same as in the previous problem. (1) set F1, avalue (2) addi F2, F1, 1 (3) multdi F3, F2, x // F3 = F2 * x (4) load F2, &c (5) addi F1, F2, 2 (6) addi F2, F3, 1 (7) add F1, F1, F2 Suppose that the pipeline has one‐stage fetch (IF), one‐stage decode‐rename‐dispatch (ID), some execute stages (explained later), one‐stage write back (WB), and one‐stage retirement stage (RE). Suppose that all instructions take 2 cycles in the execute stage, except the multd which takes 4 cycles in the execute stage. Suppose that we have a 2‐way superscalar processor as follows. In each cycle, the processor can fetch up to 2 instructions and decode up to 2 instructions. It can issue and execute more than 2 instructions if the functional units are available, but can commit up to 2 instructions per cycle. Assume there are unlimited reservation station entries. There is a full forwarding/bypassing network. There is Tomasulo+ROB style register renaming, hence all WAR and WAW dependences are eliminated as long as there are free Reorder Buffer (ROB) entries available. If an instruction needs to be dispatched but there is no free ROB entry available, the instruction cannot be dispatched, and the dispatch is retried when an entry becomes available. Suppose that all instructions take 2 cycles to execute except the multd which takes 4 cycles to execute (i.e., requires 4 EX stages). Suppose that memory instructions (load/store) take 2 cycles in an adder unit (EX) to calculate target addresses and 2 extra cycles in the memory unit (MEM). Show the pipeline timing diagram for a 2‐way superscalar processor. Thus, each cycle, it can fetch 2 instructions and decode 2 instructions. It can issue and execute more than 2 instructions if the functional units are available, but can commit up to 2 instructions. Assume unlimited reservation station entries and unlimited functional units. (a) Suppose now we support register renaming using reorder buffer. Suppose we have 7 reorder buffer entries (ROB1, ROB2, ...). Assume that all instructions have been dispatched (thus, they all occupy the reorder buffers). Rename all the instruction destination registers to tags, and update all consumers appropriately. Use reorder buffer entries as the tags, e.g.: load R1, &a add R3, R1, R2 becomes load ROB1, &a add ROB2, ROB1, R2 Show the renamed code that completely eliminate WAW and WAR dependencies. (b) Assume that we have unlimited number of ROB entries and that no instruction causes an exception. Show the timing diagram. Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 set F1, avalue IF addi F2, F1, 1 IF multdi F3, F2, x load F2, &c addi F1, F2, 2 addi F2, F3, 1 add F1, F1, F2 15 16 17 18 19 20 21 22 23 24 25 26 27 28 (c) Now assume that we only have 3 reorder buffer entries. Show the timing diagram. Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 set F1, avalue IF addi F2, F1, 1 IF multdi F3, F2, x load F2, &c addi F1, F2, 2 addi F2, F3, 1 add F1, F1, F2 15 16 17 18 19 20 21 22 23 24 25 26 27 28 (d) Based on your answer in part (c), summarize the impact of long latency instructions and reorder buffer size to processor performance. Discuss also the case where an L2 cache miss may take hundreds of cycles to complete.