辅导案例-CMPEN 431 OOO

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
CMPEN 431 OOO Superscalar.1 Sampson Fall 2019 PSU
CMPEN 431
Computer Architecture
Fall 2019
Jack Sampson( www.cse.psu.edu/~sampson )
[Slides adapted from work by Mary Jane Irwin, in turn adapted from Computer
Organization and Design, Revised 4th Edition,
Patterson & Hennessy, © 2011, Morgan Kaufmann & 5th Edition,
Patterson & Hennessy, © 2014, MK
With additional thanks/credits to Amir Roth, Milo Martin, CIS/UPenn]
Dynamically Scheduled SuperScalar (OOO) Processors, Ch.4F
CMPEN 431 OOO Superscalar.2 Sampson Fall 2019 PSU
Review: Multiple Instruction Issue Possibilities
❑ Fetch and issue more than one instruction in a cycle
1. Statically-scheduled (in-order)
Very Long Instruction Word (VLIW) e.g., TransMeta (4-wide)
- Compiler figures out what can be done in parallel, so the hardware
can be dumb and low power
- Compiler must group parallel instr’s, requires new binaries
SuperScalar e.g., Pentium (2-wide), ARM CortexA8 (2-wide)
- Hardware figures out what can be done in parallel
- Executes unmodified sequential programs
Explicitly Parallel Instruction Computing (EPIC) e.g., Intel
Itanium (6-wide)
- A compromise: compiler does some, hardware does the rest
2. Dynamically-scheduled (out-of-order) SuperScalar
Hardware dynamically determines what can be done in parallel
(can extract much more ILP with OOO processing)
E.g., Intel Pentium Pro/II/III (3-wide), Core i7 (4 cores, 4-wide,
SMT2), IBM Power5 (5-wide), Power8 (12 cores, 8-wide, SMT8)
CMPEN 431 OOO Superscalar.3 Sampson Fall 2019 PSU
The Impact of Data Dependence on Scheduling
❑ RAW
When more than one applies, RAW dominates:
add $t0,$t1,$t2
addi $t0,$t0,1
Must be respected: no way to avoid sequential execution
❑ WAR/WAW on registers
Two different things can happen when using the same name
depending on instruction ordering
Can be eliminated by register renaming (saw this with VLIW/SSA)
❑ WAR/WAW on memory
Can’t (practically) rename memory and don’t know if there is an
actual dependency until the effective address is known (in Exec)
Need to use something other than register renaming
CMPEN 431 OOO Superscalar.4 Sampson Fall 2019 PSU
Control Dependence and Instruction Scheduling
❑ Using branch prediction we may end up executing
instructions that should not have been executed (i.e., the
prediction is incorrect), thereby violating the control
dependencies
But, as long as we don’t change the visible machine state, it is still
okay (we just used some energy doing work that has to be thrown
away)
❑ The key is having a way to execute beyond (several)
predicted branches without changing the visible machine
state until you know for sure that the branch prediction
was correct
CMPEN 431 OOO Superscalar.5 Sampson Fall 2019 PSU
Exception Dependence
❑ We also have to provide for precise interrupts, i.e., those
synchronous to program (instruction) execution, to support
virtual memory (TLB and/or page faults) and deal with
undefined instructions, arithmetic overflow, etc.
❑ We also have to preserve exception (interrupt) behavior 
any changes in instruction execution order must not
change the order in which exceptions are raised, or cause
new exceptions to be raised
Example:
beq $t0,$t1,L1
lw $t2,0($s1)
L1:
Can there be a problem with executing lw before beq?
CMPEN 431 OOO Superscalar.6 Sampson Fall 2019 PSU
Dynamic Scheduling in HW
❑ Key question: Can implementable hardware detect and
orchestrate, at runtime, the necessary dependences in
order to produce an instruction execution schedule that
obeys all three of {data, control, exception} logical ordering
constraints expressed in the original program binary and
encoded in ISA semantics… but that better reflects the
naturally occurring instruction level parallelism of the
original program dependence graph?
❑ Short answer: YES! (But it requires some new HW
mechanisms)
❑ Long answer: The next several dozen slides ☺
CMPEN 431 OOO Superscalar.7 Sampson Fall 2019 PSU
Dynamic OOO Datapaths: Historical Precursors
❑ Scoreboarding – CDC 6600 (Thornton) first pub. in 1964
Used centralized hazard detection logic (scoreboard) to support
OOO execution. Instr’s were stalled when their FU was busy, for
RAW dependencies, and for WAW and WAR dependencies
❑ Tomasulo – IBM 360/91 (Tomasulo) first pub. in 1967
Used distributed hazard detection logic (reservation stations
feeding each FU) to support OOO execution with register
renaming that eliminated WAW and WAR dependencies;
distributed results from FUs to reservation stations on a Common
Data Bus (potential bottleneck)
Writes results to register file and memory when instr’s completes –
possibly out-of-order – so could not support precise interrupts or
speculative execution (e.g., branch speculation)
http://www.ecs.umass.edu/ece/koren/architecture/Tomasulo1/tomasulo.htm
CMPEN 431 OOO Superscalar.8 Sampson Fall 2019 PSU
Dynamic OOO Datapaths in Microprocessors
❑ HPS – (Hwu, Patt, Shebanow) first publication in 1985
Used a register alias table and distributed node alias tables that
fed each FUs (essentially reservation stations) to support OOO
execution with register renaming; distributed results from FUs to
reservation stations on multiple distribution buses (one per FU)
Supported precise interrupts and speculative execution with a
checkpoint repair mechanism
❑ RUU – (Sohi) first publication in 1987
Uses a centralized Register Update Unit (RUU) that 1) receives
new instr’s from decode, 2) renames registers, 3) monitors the
(single) result bus to resolve dependencies, 4) determines when
instr’s are ready to issue (send for execution), and 5) holds
completed instr’s until they can commit
Supports precise interrupts and speculative execution with in-
order commit out of the RUU
Basis of SimpleScalar’s datapath architecture
CMPEN 431 OOO Superscalar.9 Sampson Fall 2019 PSU
Basic OOO Instruction Flow Overview
1. Fetch (in program order): Fetch multiple sequential
instructions in parallel from the IM (I$)
2. Decode, Rename, & Dispatch (in program order):
In parallel, decode all of the instr’s just fetched, rename the
architected registers (ArchitectedRegFile (ARF)) with rename
registers (PhysicalRegFile (PRF)), and schedule renamed instr’s
for execution by dispatching them to the IQ (Instruction Queue)
and the ROB (ReOrder Buffer) (combined in the RUU in
SimpleScalar)
Loads and stores are dispatched as two (micro)instr’s – one to the
IQ to compute the addr and one to LSQ (LoadStoreQueue) for the
memory operation
CMPEN 431 OOO Superscalar.10 Sampson Fall 2019 PSU
Basic OOO Instruction Flow Overview, Con’t
3. Issue (Out Of Order - OOO): When an instr in the IQ has
all of its source data and the FU (Functional Unit) it needs
is free, it is issued for execution
4. Writeback (Out Of Order - OOO): When the dst value has
been computed it is written back to the PRF, the IQ, ROB
and LSQ are updated – the instr completes execution
Stores DO NOT write to cache at this stage, and the ARF is NOT
updated
CMPEN 431 OOO Superscalar.11 Sampson Fall 2019 PSU
Basic OOO Instruction Flow Overview, Con’t
5. Commit (in program order): Only commit the instr’s
result data to the state locations (i.e., update DM (D$),
ARF) when it is the oldest completed instr in the ROB
Exceptions delayed until this point
Branch misprediction cleanup potentially delayed to this point –
can be performed more aggressively at WB
No matter how wide commit is, cannot skip over uncompleted
instructions
CMPEN 431 OOO Superscalar.12 Sampson Fall 2019 PSU
Out-of-Order Pipeline
F
e
tc
h
D
e
c
o
d
e
R
e
n
a
m
e
D
is
p
a
tc
h
C
o
m
m
it
Buffers of renamed instructions (IQ,ROB,LSQ)
Is
s
u
e
R
e
n
a
m
e
-R
e
g
-R
e
a
d
E
x
e
c
u
te
W
ri
te
b
a
c
k
In-order front end
Out-of-order execution
In-order
commit
Entirely new
In-order pipeline
stages
Similar to, but not the
same as “Writeback”
Dataflow based execution engine
operating on PHYSICAL register
state, mapping to/from
ARCHITECTURAL state
CMPEN 431 OOO Superscalar.13 Sampson Fall 2019 PSU
Basic OoO Instruction Flow Overview
1. Fetch
2. Decode
3. Rename
4. Dispatch
5. Issue
6. Writeback
7. Commit
In Original Program Order
In Original Program Order
In Dataflow Order
Isolation via buffering
Isolation via buffering
N
o
n
-s
p
e
c
u
la
ti
v
e
C
o
n
tr
o
l
Speculative
Control
Semi-Speculative
Control
CMPEN 431 OOO Superscalar.15 Sampson Fall 2019 PSU
Our Code Example
lp(0): lw $t0,0($s1) #cache miss, assume 3 cycle stall
addu $t0,$t0,$s2
sw $t0,0($s1)
sub $t0,$s1,$s2 #provides WAW hazard
addi $s1,$s1,-4
bne $s1,$0,lp #predict taken (and is)
lp(1): lw $t0,0($s1) #cache hit (from here on)
addu $t0,$t0,$s2
sw $t0,0($s1)
sub $t0,$s1,$s2
addi $s1,$s1,-4
bne $s1,$0,lp
lp(3): ...
RAW WAR WAW
CMPEN 431 OOO Superscalar.16 Sampson Fall 2019 PSU
Code Dependency Observations
❑ Lots of both true and false
dependencies
❑ sub instr independent of
other instr’s (has no true
dependencies)
So can execute in parallel
with another instr
Are there others?
❑ Registers re-used
Just as in static SS, the
register names get in the
way
How can the hardware get
around this?
lp(0): lw $t0,0($s1)
addu $t0,$t0,$s2
sw $t0,0($s1)
sub $t0,$s1,$s2
addi $s1,$s1,-4
bne $s1,$0,lp
lp(1): lw $t0,0($s1)
addu $t0,$t0,$s2
sw $t0,0($s1)
sub $t0,$s1,$s2
addi $s1,$s1,-4
bne $s1,$0,lp
lp(3): ...
CMPEN 431 OOO Superscalar.17 Sampson Fall 2019 PSU
Register Renaming
❑ Can use register renaming to eliminate (WAW, WAR)
(register) data dependencies – conceptually write each
register once
+ Removes false dependences (WAW and WAR)
+ Leaves true dependences (RAW) intact
❑ “Architected” vs “Physical” registers
Architected (ISA) register names: $t0,$s1,$s1,$s2, etc
Physical register names: p1,p2,p3,p4,p5,p6,p7
❑ Need two hardware structures to enable renaming
A Map Table showing the architected register that the physical
register is currently “impersonating”
A Free List of physical registers not currently in use
CMPEN 431 OOO Superscalar.18 Sampson Fall 2019 PSU
Renaming Example: Initial State
$s1 p1
$s2 p2
$t0 p3
Map Table Free List
p4
p5
p6
p7
p8
lw $t0,0($s1)
addu $t0,$t0,$s2
sw $t0,0($s1)
sub $t0,$s1,$s2
addi $s1,$s1,-4
bne $s1,$0,lp
CMPEN 431 OOO Superscalar.19 Sampson Fall 2019 PSU
Renaming Example: lw Renaming
$s1 p1
$s2 p2
$t0 p3
Map Table Free List
p4
p5
p6
p7
p8
lw $t0,0($s1)
addu $t0,$t0,$s2
sw $t0,0($s1)
sub $t0,$s1,$s2
addi $s1,$s1,-4
bne $s1,$0,lp
Over-written Reg
lw p4,0(p1) [p3]
p4
CMPEN 431 OOO Superscalar.21 Sampson Fall 2019 PSU
Which Register to Free at Commit ?
❑ The over-written (physical) register can be freed at
Commit (i.e., added back to the Free List)
It can be safely freed at commit because any instruction using that
register must have already committed because both rename and
commit are in-order and any NEWER instructions would use this
instruction’s destination register (or an even newer name) instead
Destination register CANNOT be freed at commit because
subsequent dependent instructions may still exist in the pipeline
❑ We also need to keep track of the over-written (physical)
register so that it can be restored in the Map Table on a
recovery from mis-predicted branches and recovery from
exceptions
CMPEN 431 OOO Superscalar.22 Sampson Fall 2019 PSU
Renaming Example: addu Renaming
$s1 p1
$s2 p2
$t0
Map Table Free List
p4
p5
p6
p7
p8
lw $t0,0($s1)
addu $t0,$t0,$s2
sw $t0,0($s1)
sub $t0,$s1,$s2
addi $s1,$s1,-4
bne $s1,$0,lp
Over-written Reg
lw p4,0(p1) [p3]
addu p5,p4,p2 [p4]
5
CMPEN 431 OOO Superscalar.24 Sampson Fall 2019 PSU
Renaming Example: sw Renaming
$s1 p1
$s2 p2
$t0
Map Table Free List
p5
p6
p7
p8
lw $t0,0($s1)
addu $t0,$t0,$s2
sw $t0,0($s1)
sub $t0,$s1,$s2
addi $s1,$s1,-4
bne $s1,$0,lp
Over-written Reg
lw p4,0(p1) [p3]
addu p5,p4,p2 [p4]
sw p5,0(p1)
CMPEN 431 OOO Superscalar.26 Sampson Fall 2019 PSU
Renaming Example: sub Renaming
$s1 p1
$s2 p2
$t0
Map Table Free List
p5
p6
p7
p8
lw $t0,0($s1)
addu $t0,$t0,$s2
sw $t0,0($s1)
sub $t0,$s1,$s2
addi $s1,$s1,-4
bne $s1,$0,lp
Over-written Reg
lw p4,0(p1) [p3]
addu p5,p4,p2 [p4]
sw p5,0(p1)
sub p6,p1,p2 [p5]
6
CMPEN 431 OOO Superscalar.28 Sampson Fall 2019 PSU
Renaming Example: addi Renaming
$s1 p1
$s2 p2
$t0
Map Table Free List
p6
p7
p8
lw $t0,0($s1)
addu $t0,$t0,$s2
sw $t0,0($s1)
sub $t0,$s1,$s2
addi $s1,$s1,-4
bne $s1,$0,lp
Over-written Reg
lw p4,0(p1) [p3]
addu p5,p4,p2 [p4]
sw p5,0(p1)
sub p6,p1,p2 [p5]
addi p7,p1,-4 [p1]
7
CMPEN 431 OOO Superscalar.29 Sampson Fall 2019 PSU
Renaming Example: bne Renaming
$s1 p1
$s2 p2
$t0
Map Table Free List
p6
p7
p8
lw $t0,0($s1)
addu $t0,$t0,$s2
sw $t0,0($s1)
sub $t0,$s1,$s2
addi $s1,$s1,-4
bne $s1,$0,lp
Over-written Reg
lw p4,0(p1) [p3]
addu p5,p4,p2 [p4]
sw p5,0(p1)
sub p6,p1,p2 [p5]
addi p7,p1,-4 [p1]
p7
bne p7,p0,lp
CMPEN 431 OOO Superscalar.31 Sampson Fall 2019 PSU
Our Code Example After Renaming
lp(0): lw p4,0(p1) #[p3];cache miss, 3 cycle stall
addu p5,p4,p2 #[p4]
sw p5,0(p1)
sub p6,p1,p2 #[p5]
addi p7,p1,-4 #[p1]
bne p7,p0,lp #predict taken (and is)
lp(1): lw p8,0(p7) #[p6];cache hit
addu p9,p8,p2 #[p8]
sw p9,0(p7)
sub p10,p7,p2 #[p9]
addi p11,p7,-4 #[p7]
bne p11,p0,lp
lp(3): ...
RAW WAR - none WAW - none
❑ As promised, renaming eliminated false data dependencies
(WAW, WAR) and left true data dependencies (RAW) intact
CMPEN 431 OOO Superscalar.32 Sampson Fall 2019 PSU
Out-of-Order Pipeline Progress
F
e
tc
h
D
e
c
o
d
e
R
e
n
a
m
e
D
is
p
a
tc
h
C
o
m
m
it
Buffers of renamed instructions (IQ,ROB,LSQ)
Is
s
u
e
R
e
n
a
m
e
-R
e
g
-R
e
a
d
E
x
e
c
u
te
W
ri
te
b
a
c
k
In-order front end
Out-of-order execution
In-order
commit
❑ Have completed Fetch, Decode, Rename (in program order)
and are ready to Dispatch
Instr’s now have unique register names, so
can now put into OOO execution structures
CMPEN 431 OOO Superscalar.33 Sampson Fall 2019 PSU
Dispatch
❑ Renamed instructions are placed into OoO hardware data
structures
1. Issue Queue (IQ) (SimpleScalar’s RUU)
Central piece of scheduling logic holding un-executed instr’s
Accessible as both a RAM and a CAM (Content Addr Memory)
2. Re-order buffer (ROB) (SimpleScalar’s RUU)
Holds all instructions (in order) until Commit time
Keeps track of the over-written register so they can be returned to
the free list and to support recovery from mispredicted branches
and exceptions
3. Load-Store Queue (LSQ)
Loads and stores dispatched in two parts - one going to the IQ for
effective address calculation and the other to the LSQ for loads
and stores going to the DM
Stores not sent to DM until Commit time, what about loads ?
CMPEN 431 OOO Superscalar.35 Sampson Fall 2019 PSU
Issue Queue (IQ)
❑ Holds un-executed instructions
Instruction op and instruction “age”
❑ Tracks status of source inputs (ready, not ready)
Physical (renamed) source register names + a ready bit for
each source operand
- AND the ready bits to tell if the instruction is ready to issue (send
for execution)
❑ Physical (renamed) destination register
Instr Src1 R Src2 R Dst
Ready?
Age
CMPEN 431 OOO Superscalar.36 Sampson Fall 2019 PSU
Aside: Content Addressable Memories (CAMs)
❑ Storage hardware that is addressed by its content. Typical
applications include ROB source tag field comparison
logic, cache tags, and (highly-associative) TLBs
Match
Field
Data
Field
Hit Match Data
Search Data
Hardware that compares the Search
Data to the Match Field entries for each
word in the CAM in parallel !
On a match the Hit bit is set and the
Data Field for that entry is output to
Match Data on read or the Match Data
is written into the Data Field on write
If no match occurs, the Hit bit is reset
CAMs can be designed to
accommodate multiple hits
❑ A storage structure can have ports of both types (RAM & CAM)
CMPEN 431 OOO Superscalar.37 Sampson Fall 2019 PSU
Dispatch Steps
❑ Allocate IQ (and ROB) slot
Full? Stall
Not full? Find an empty slot in the IQ
❑ Read ready bits of inputs (source registers) from a Ready
Table
Ready Table: 1-bit per physical register indicating whether or not
that physical register value has been produced
❑ Clear ready bit of output (destination register) in Ready
Table
Instruction has not produced value yet
❑ Write instruction data in the allocated IQ slot
❑ Recall that lw and sw go into both the IQ (for computing
the effective address) and the LSQ (which interfaces with
the DM)
CMPEN 431 OOO Superscalar.38 Sampson Fall 2019 PSU
Dispatch Example, lw Dispatch
Instr Src1 R Src2 R Dst Age
Issue Queue
p0 y
p1 y
p2 y
p3 y
p4 y
p5 y
p6 y
p7 y
p8 y
Ready Table
lp(0): lw p4,0(p1) #[p3]
addu p5,p4,p2 #[p4]
sw p5,0(p1)
sub p6,p1,p2 #[p5]
addi p7,p1,-4 #[p1]
bne p7,p0,lp #
p9 y
lw 0+p1 p4 0y y n
CMPEN 431 OOO Superscalar.40 Sampson Fall 2019 PSU
Dispatch Example, addu Dispatch
Instr Src1 R Src2 R Dst Age
Issue Queue
p0 y
p1 y
p2 y
p3 y
p4 n
p5 y
p6 y
p7 y
p8 y
Ready Table
lp(0): lw p4,0(p1) #[p3]
addu p5,p4,p2 #[p4]
sw p5,0(p1)
sub p6,p1,p2 #[p5]
addi p7,p1,-4 #[p1]
bne p7,p0,lp #
p9 y
lw 0+p1 p4 0y y
naddu p4 n p2 y p5 1
CMPEN 431 OOO Superscalar.42 Sampson Fall 2019 PSU
Dispatch Example, sw Dispatch
Instr Src1 R Src2 R Dst Age
Issue Queue
p0 y
p1 y
p2 y
p3 y
p4 n
p5 n
p6 y
p7 y
p8 y
Ready Table
lp(0): lw p4,0(p1) #[p3]
addu p5,p4,p2 #[p4]
sw p5,0(p1)
sub p6,p1,p2 #[p5]
addi p7,p1,-4 #[p1]
bne p7,p0,lp #
p9 y
lw 0+p1 p4 0y y
addu p4 n p2 y p5 1
sw n 0+p1 y 2p5
CMPEN 431 OOO Superscalar.44 Sampson Fall 2019 PSU
Dispatch Example, sub Dispatch
Instr Src1 R Src2 R Dst Age
Issue Queue
p0 y
p1 y
p2 y
p3 y
p4 n
p5 n
p6 y
p7 y
p8 y
Ready Table
lp(0): lw p4,0(p1) #[p3]
addu p5,p4,p2 #[p4]
sw p5,0(p1)
sub p6,p1,p2 #[p5]
addi p7,p1,-4 #[p1]
bne p7,p0,lp #
p9 y
lw 0+p1 p4 0y y
n
addu p4 n p2 y p5 1
sw n 0+p1 y 2p5
sub y p2 y p6 3p1
CMPEN 431 OOO Superscalar.46 Sampson Fall 2019 PSU
Dispatch Example, addi Dispatch
Instr Src1 R Src2 R Dst Age
Issue Queue
p0 y
p1 y
p2 y
p3 y
p4 n
p5 n
p6 n
p7 y
p8 y
Ready Table
lp(0): lw p4,0(p1) #[p3]
addu p5,p4,p2 #[p4]
sw p5,0(p1)
sub p6,p1,p2 #[p5]
addi p7,p1,-4 #[p1]
bne p7,p0,lp #
p9 y
lw 0+p1 p4 0y y
n
addu p4 n p2 y p5 1
sw n 0+p1 y 2p5
sub y p2 y p6 3p1
y -4 y p7 4p1addi
CMPEN 431 OOO Superscalar.47 Sampson Fall 2019 PSU
Dispatch Example, bne Dispatch
Instr Src1 R Src2 R Dst Age
Issue Queue
p0 y
p1 y
p2 y
p3 y
p4 n
p5 n
p6 n
p7 n
p8 y
Ready Table
lp(0): lw p4,0(p1) #[p3]
addu p5,p4,p2 #[p4]
sw p5,0(p1)
sub p6,p1,p2 #[p5]
addi p7,p1,-4 #[p1]
bne p7,p0,lp #
p9 y
lw 0+p1 p4 0y y
addu p4 n p2 y p5 1
sw n 0+p1 y 2p5
sub y p2 y p6 3p1
y -4 y p7 4p1addi
n p0 y 5p7bne
CMPEN 431 OOO Superscalar.48 Sampson Fall 2019 PSU
Out-of-Order Pipeline Progress
F
e
tc
h
D
e
c
o
d
e
R
e
n
a
m
e
D
is
p
a
tc
h
C
o
m
m
it
Buffers of renamed instructions (IQ,ROB,LSQ)
Is
s
u
e
R
e
n
a
m
e
-R
e
g
-R
e
a
d
E
x
e
c
u
te
W
ri
te
b
a
c
k
Instr’s now have unique (physical)
register names
In-order front end
Out-of-order execution
In-order
commit
Dispatched instr’s now in OoO IQ
CMPEN 431 OOO Superscalar.49 Sampson Fall 2019 PSU
Out-of-Order Execution Pipeline Stages
❑ Execution (out-of-order) stages
Issue
Rename-Reg-Read
Execute
Rename-Writeback
❑ Issue
1. Select ready instructions
Send (Issue) them for execution
2. Wakeup dependent instructions
in the IQ
❑ OoO execution pipeline has necessary forwarding
hardware and multiple FU’s of different types (some of
them with multiple pipeline stages)
❑ Remember, read and writeback are from/to the physical
(rename) RF
CMPEN 431 OOO Superscalar.51 Sampson Fall 2019 PSU
Issue = Select + Wakeup
❑ Select N oldest, ready instr’s to send for execution
(checking for structural hazards (e.g., FUs))
Assume lw has already been issued to memory
and it’s 3 cycle cache miss is still pending
sub and addi are the two oldest ready instr’s
Ready!
Ready!
Instr Src1 R Src2 R Dst Age
Issue Queue (IQ)
lw 0+p1 p4 0y y
addu p4 n p2 y p5 1
sw n 0+p1 y 2p5
sub y p2 y p6 3p1
y -4 y p7 4p1addi
n p0 y 5p7bne
p0 y
p1 y
p2 y
p3 y
p4 n
p5 n
p6 n
p7 n
p8 y
Ready Table
p9 y
Issued
CMPEN 431 OOO Superscalar.52 Sampson Fall 2019 PSU
Issue = Select + Wakeup
❑ Wakeup dependent instr’s
CAM search for dst addr in Src1 and Src2 and
set ready bit (R) on match
Update Ready Table for Dispatch of future instr’s
Instr Src1 R Src2 R Dst Age
IQ
lw 0+p1 p4 0y y
addu p4 n p2 y p5 1
sw n 0+p1 y 2p5
sub y p2 y p6 3p1
y -4 y p7 4p1addi
n p0 y 5p7bne
p0 y
p1 y
p2 y
p3 y
p4 n
p5 n
p6 n
p7 n
p8 y
Ready Table
p9 y
Assoc Search
for p6 and p7
Assoc Search
for p6 and p7
y
y
y
Issued
Ready!
Ready!
CMPEN 431 OOO Superscalar.54 Sampson Fall 2019 PSU
Next Issue = Select + Wakeup
❑ Select and Wakeup done in one cycle, sub and addi
have been issued for execution (and removed from IQ)
❑ lw has just completed and p4 is now ready
❑ So, which instr’s will be issued next ?
Ready!
Instr Src1 R Src2 R Dst Age
addu p4 n p2 y p5 1
sw n 0+p1 y 2p5
y p0 y 5p7bne
p0 y
p1 y
p2 y
p3 y
p4 n
p5 n
p6 y
p7 y
p8 y
Ready Table
p9 y
y
yReady!
Assoc Search
for p4
Assoc Search
for p45 5
IQ
y
y
CMPEN 431 OOO Superscalar.55 Sampson Fall 2019 PSU
Aside: (Rename) Register Read
❑ When do instructions read the physical register file?
Obviously cannot be done at Decode (not renamed yet)
1. Option #1: after Issue (Select), right before Execute
Read physical (renamed) register
Or get value via forwarding (based on physical register name)
Pentium 4, MIPS R10k
❑ Physical register file may be large
Could be a multi-cycle read
2. Option #2: as part of Dispatch, keep the data values (if
known) in the IQ (along with the Pregaddr for the Issue
(Wakeup) associative search)
Means bigger IQ entries (+32b or 64b per source value)
Pentium Pro, Core 2, Core i7i7 (implemented as FU Reservation
Stations (rather than as a centralized IQ))
CMPEN 431 OOO Superscalar.56 Sampson Fall 2019 PSU
Out-of-order Pipeline – The Detailed View
F
e
tc
h
D
e
c
o
d
e
R
e
n
a
m
e
/D
is
p
a
tc
h


C
o
m
m
it
Issue Queue (IQ)
Is
s
u
e
/P
R
e
g
R
e
a
d
W
ri
te
B
a
c
k
In-order
front end
Out-of-order execution
In-order
commit
BTB Ready Table
ReOrder Buffer (ROB)
E
x
e
c
u
te
E
x
e
c
u
te
Head
Ready
Instr’s
PRegFile
Tail
ARegFile
ITLB
I$
BHT
Map Table
Free List
D$
DTLB
Load Store Queue (LSQ)
Empty
Slot
CMPEN 431 OOO Superscalar.57 Sampson Fall 2019 PSU
Re-Order Buffer (ROB)
❑ All instructions Commit in order
At commit write the physical register value to the ISA register
and free the overwritten physical register (add it back to the Free
List), for store instr’s write the data in the LSQ to the D$ (more
on this soon), and free the LSQ and ROB entries for reuse
❑ Two other purposes
To support recovery from branch misprediction and to support
precise (synchronous) interrupts
- Flush the ROB, IQ, and LSQ, restore Map Table and Ready Table
to before misprediction/interrupt, and free the physical registers
(update Free List) (wasted time, wasted power – why accurate
branch prediction is sooo important for OOO datapaths (not as bad
for interrupts since they are relatively infrequent)) and …
- On mispredicted branch at ROB head, update BHT, BTB, restart
the pipeline at the branch (with the correct prediction this time)
- On interrupt of instruction at ROB head, service the interrupt,
restart the pipeline at the interrupting instr
CMPEN 431 OOO Superscalar.58 Sampson Fall 2019 PSU
Re-Order Buffer (ROB) Data
❑ ROB entry has to keep track of all of the info needed for
Commit & Recover
Physical (renamed) dst register addr and its architectural (ISA)
equivalent (so can update ARegFile on completion)
Overwritten physical register name (for release and recovery)
Instruction address (PC) and type (in particular store, branch)
A way of determining when the instr completes execution
Exception (interrupt) and branch outcome information
❑ On Dispatch: insert at tail
Full? Stall
❑ Commit: remove from head
Instr at head not completed? No instr to commit this cycle
Multiple instr’s at head completed … commit multiple instr’s this
cycle (if have hardware to support it)
http://www.ecs.umass.edu/ece/koren/architecture/ROB/rob_simulator.htm
CMPEN 431 OOO Superscalar.59 Sampson Fall 2019 PSU
Speculation in OoO Machines
❑ Speculation allows execution of future instr’s that (may) depend
on the speculated instruction
Speculate on the outcome of a conditional branch (branch prediction)
just don’t commit until the branch outcome is known
Speculate that a store (for which address is unknown) that precedes
a load does not refer to the same address, allowing the younger load
to be executed before the older store (load speculation) not
committing the load until the speculation has cleared
❑ Must have hardware mechanisms for
Checking to see if the guess was correct
Recovering from incorrect speculation – only commit out of the ROB
when are sure speculation is correct
❑ Ignore and/or buffer exceptions created by speculatively
executed instructions until it is clear that they should really
occur (i.e., not allowed to change the machine state until
commit time)
CMPEN 431 OOO Superscalar.60 Sampson Fall 2019 PSU
Commit
Instr PReg AReg C?PC
ROB
lwXXX0
addu p5
sw
sub p6
p7addi
bne
ReReg
p3
p4
p5
p1
XXX1
XXX3
XXX2
XXX4
XXX5
p4 $t0
$t0
$t0
$s1
S/B
B
y
y
y
Head
Tail
❑ Commit: instr takes on its architected state
In-order, so only when the instr is finished (C?) and at the
Head of the ROB
Copy the data from the physical dst register (PReg) to the
ISA (architected) dst register (AReg)
Free the overwritten physical register (ReReg)
S
CMPEN 431 OOO Superscalar.61 Sampson Fall 2019 PSU
Freeing the Overwritten ReReg
lp(0): lw p4,0(p1) #[p3]
addu p5,p4,p2 #[p4]
sw p5,0(p1)
sub p6,p1,p2 #[p5]
addi p7,p1,-4 #[p1]
bne p7,p0,lp #
lp(1): lw p8,0(p7) #[p6]
addu p9,p8,p2 #[p8]
sw p9,0(p7)
sub p10,p7,p2 #[p9]
addi p11,p7,-4 #[p7]
bne p11,p0,lp
❑ When lw commits put
p3 back on the free list
When does p4 go back
on the free list?
❑ What if first bne is found to be mis-predicted when it
gets to the head of the ROB? Need to restore the map
table to the after the first addi state and restart at bne
$s1 p7 → p11
$s2 p2
$t0 p6 → p8 → p9 → p10
Map Table (from 1st to 2nd bne)
CMPEN 431 OOO Superscalar.62 Sampson Fall 2019 PSU
What if first bne is mispredicted ?
❑State at misprediction
ROB contains 1st bne
and all of the instr’s
after it that have been
dispatched
Map Table
Free List
❑Cleaned up state
Flush ROB, IQ, LSQ
Restore
Map Table
Update Free List
Restore Ready Table
Fix BHT, BTB
Restart dispatch at
bne(0)
$s1 p1
$s2 p2
$t0 p10
p11
p12, p13, p14, p15 …
$s1 p1
$s2 p2
$t0 p6
p7
bne(1) addi(1) sub(1) sw(1) addu(1) lw(1) bne(0)
p8, p9, p10, p11 …
HeadTail
CMPEN 431 OOO Superscalar.63 Sampson Fall 2019 PSU
Load Store Queue (LSQ)
❑ Loads and stores are dispatched to the IQ, to the LSQ
(the interface to the DM), and to the ROB
When ready, loads and stores are issued (for effective address
calculation) and their IQ entries are released
When the effective address or store source has been calculated,
it is compared to find the matching EAddr / Src entries in the LSQ
Instr Src R EAddr R Dst Age
LSQ
lw (1) 0+p1 p4 0y y
sw (1) 0+p1p5 2n y
lw (2) 0+p7 p8 6y n
sw (2) 0+p7p9 8n n
Assoc Search
for 0+p7
Ready!
Issued
y
y
Assoc Search for p5
yReady!
CMPEN 431 OOO Superscalar.64 Sampson Fall 2019 PSU
Memory Location Data Dependencies
❑ RAW, WAR and WAW memory data dependencies
Memory storage conflicts are less frequent since memory locations
are not used (and reused) in the same way that registers are
❑ Stores are committed to the DM from the LSQ in program
order at commit time (when they are at the head of the
ROB); since stores commit in order there are no WAW
hazards. There are also no WAR hazards since there are
also no older loads (they have already been committed).
sw $t0,0($s1)
lw $t1,0($s1)
sw $t0,0($s1)
sw $t1,0($s1)
lw $t0,0($s1)
sw $t1,0($s1)
RAW, true dependence (cannot
reorder, what to do?)
WAR, anti-dependence (write commit
in order fixes, lw will have already
been committed)
WAW, memory output dependence
(write commit in order fixes)
CMPEN 431 OOO Superscalar.65 Sampson Fall 2019 PSU
Loads from Memory
❑ When an issued load instr completes execution, the load
data is written to the PRegFile, the Ready Table is updated,
and the load source register addr is compared
(associatively) to see if it matches the Src addr’s of instr’s in
the IQ and the Src addr’s in the LSQ; the load’s LSQ entry
is released
❑ Note that the oldest load is “issued” for execution out of the
LSQ to the DM
If there is a EAddr match with another (younger) load, that younger
load may not need to be executed since the current load may load
in the data the younger load needs
However, it there is an intervening store between the issuing load
and the younger load all with the same effective address then the
store has the data the younger load needs (store->load forwarding)
CMPEN 431 OOO Superscalar.66 Sampson Fall 2019 PSU
Load Bypassing
❑ For better performance younger loads can bypass (be
issued before) older loads and stores in the LSQ under
certain conditions
Loads bypassing stores - Ready loads can bypass previous (older)
stores as long as their effective addresses are known and different
(so there is no RAW hazard)
sw $t0,0($s1)
lw $t1,0(???)
sw $t0,0($s1)
lw $t1,0($s2)
lw $t0,0(???)
lw $t1,0($s2)
lw $t1,0($s2)
lw $t0,0(???)
lw $t1,0($s2)
sw $t0,0($s1)
lw $t1,0(???)
sw $t0,0($s1)
Loads bypassing loads - Ready (EAddr has been calculated) loads
in the LSQ can bypass previous (older) unready loads
- What if they are to the same EAddr? Who cares, no harm done.
CMPEN 431 OOO Superscalar.67 Sampson Fall 2019 PSU
Load Bypassing with Load Forwarding
❑ Load forwarding – when a load’s data is supplied directly
from an older store in the LSQ
The most recent older matching LSQ store data value is supplied
to the load (beware! there could be more than one matching
store)
0
1
2
3
total order load byp load fwd
S
p
e
e
d
u
p
Low
HM
High
❑ Load bypassing gives 19%
speedup improvement (for
a 4-way OOO datapath)
❑ Load forwarding gives an
additional 4% speedup
improvement
From Johnson, 1992
CMPEN 431 OOO Superscalar.68 Sampson Fall 2019 PSU
Stores to Memory
❑ Stores are held in the LSQ until the store is ready to
commit (in program order - when the store is at the head
of the ROB); on Commit the LSQ and ROB entries are
released
❑ In addition to the associative search for matching EAddr’s,
when a PReg becomes ready that address is compared
(associatively) with the LSQ’s Src field (stores’ data value
PReg addresses)
If there is also an effective addr match (EAddr) in the LSQ with a
load and the stores data is ready, then the store can provide the
load’s dst data if the store is the most recent store older than the
load (again, store->load forwarding)
CMPEN 431 OOO Superscalar.69 Sampson Fall 2019 PSU
OoO Scheduling Scope (Exposing More ILP)
❑ Scheduling scope = OOO window size
Larger = better
1. Constrained by the number of physical registers (PRegFile)
- ROB roughly limited by the number of physical registers
- Big register file = expensive (area) and slow
2. Constrained by size of Issue Queue
- Limits number of un-executed instructions
- CAMs = can’t make too big (power + area)
3. Constrained by size of Load+Store Queue
- Limits number of loads/stores
- CAMs = can’t make too big (power + area)
❑ Usefulness of large window: limited by branch prediction
95% branch mis-prediction rate: 1 in 20 branches, or 1 in 100
instr’s
CMPEN 431 OOO Superscalar.70 Sampson Fall 2019 PSU
ILP in a “Perfect” Dynamic SS Datapath
❑ The perfect dynamic SS datapath has
An infinite number of rename registers that eliminates all WAR,
WAW data hazards
Infinite IQ, LSQ, and ROB (so never full)
No (fetch, decode, dispatch, issue, FU, buses, ports) limit on the
number of instr’s that can begin execution simultaneously (as long
as RAW (true) data hazards are not present)
Perfect branch prediction
Perfect caches
Loads can be moved before
stores as long as there are
no RAW data hazards
All FU’s have a 1 cycle
latency
55
63
18
75
119
150
0
40
80
120
160
gc
c
es
pr
es
so l
i
fp
pp
p
do
du
c
to
m
ca
tv
IP
C
From H&P, 2003
CMPEN 431 OOO Superscalar.71 Sampson Fall 2019 PSU
Effect of IQ size on ILP
0
40
80
120
160
In
fin
ite 2K 51
2
12
8 32 8 4
IP
C
gcc
espresso
li
fpppp
doduc
tomcatv
❑ Instruction window (IQ) – the set of instructions that are
examined simultaneously for execution
From H&P, 2003
CMPEN 431 OOO Superscalar.72 Sampson Fall 2019 PSU
Effect of Finite Rename Registers
0
20
40
60
In
fin
ite 25
6
12
8 64 32
N
on
e
IP
C
gcc
espresso
li
fpppp
doduc
tomcatv
❑ On a processor with an IQ of 2K, a maximum 64-way
issue capability, and a tournament branch predictor with
8K entries
From H&P, 2003
CMPEN 431 OOO Superscalar.73 Sampson Fall 2019 PSU
Effect of Realistic Branch Prediction on ILP
❑ On a processor with an IQ of 2K and maximum 64-way
issue capability
0
20
40
60
Pe
rf
ec
t
To
ur
na
m
en
t
St
an
da
rd
2
-b
it
St
at
ic
N
on
e
IP
C
gcc
espresso
li
fpppp
doduc
tomcatv
From H&P, 2003
CMPEN 431 OOO Superscalar.74 Sampson Fall 2019 PSU
Summary: Dynamic (OoO) Scheduling
❑ Dynamic scheduling
Totally in the hardware; compiler can help (e.g., loop unrolling)
❑ Fetch many instr’s into instruction window
Use branch prediction to speculate past (multiple) branches
Flush pipeline queues on branch misprediction
❑ Rename to avoid false dependencies
❑ Execute instructions as soon as possible
Register dependencies are known
Handling memory dependencies more tricky
❑ Commit instr’s in order
Anything strange happens before commit, just flush pipeline
queues
❑ Current machines: 100+ instruction scheduling window
CMPEN 431 OOO Superscalar.75 Sampson Fall 2019 PSU
Out Of Order: Top 5 Things to Know
1. Register renaming
How to perform it and how to recover it
2. Issue/Select
Wakeup: CAM
Choose N oldest ready instructions
3. Stores
Write at commit
Forward to loads via LSQ
4. Loads
Possibility for load bypassing and load forwarding
5. Commit
Precise state maintained in the ROB
How/when physical registers are freed
CMPEN 431 OOO Superscalar.76 Sampson Fall 2019 PSU
Power Costs of OoO Execution
❑ Complexity of dynamic scheduling and recovering from
mis-speculation requires more power
❑ Multiple simpler cores may be better (power-wise)
Power*Delay product may be a better measure
Microprocessor Year Clock Rate Pipeline
Stages
Issue
width
Out-of-order/
Speculation
Cores Power
i486 1989 25MHz 5 1 No 1 5W
Pentium 1993 66MHz 5 2 No 1 10W
Pentium Pro 1997 200MHz 10 3 Yes 1 29W
P4 Willamette 2001 2000MHz 22 3 Yes 1 75W
P4 Prescott 2004 3600MHz 31 3 Yes 1 103W
Core 2006 2930MHz 14 4 Yes 2 75W
Nehalem 2010 3300MHz 14 4 Yes 4 87W
Ivy Bridge 2012 3400MHz 14 4 Yes 8 77W
CMPEN 431 OOO Superscalar.77 Sampson Fall 2019 PSU
An Example: Intel’s OoO Processors
❑ Intel’s Tick-Tock technology/processor model
A Tick processor is the “current” design fabbed at a new
technology node (feature size)
A Tock processor is a new microprocessor architecture design
fabbed at the current technology node
45nm tech
node
32nm tech
node
22nm tech
node
14nm tech
node
Nehalem West
mere
Sandy
Bridge
Ivy
Bridge
Has
well
Broad
well
Sky
lake
Tock Tick Tock Tick Tock Tick Tock
4Q 2008 1Q
2010
1Q
2011
3Q
2011
2Q
2013
4Q
2014
3Q
2015
❑ Skylake is the fifth Tock since Intel instituted its Tick-
Tock model https://en.wikipedia.org/wiki/Intel_Tick-Tock
CMPEN 431 OOO Superscalar.78 Sampson Fall 2019 PSU
Some Typical “Scope” Queue Sizes
Nehalem Sandy Bridge Haswell
Instr Decode Queue 28 per thread /
2 threads
28 per thread /
2 threads
56 total for
2 threads
ROB 128 uops 168 uops 192 uops
Res Station (IQ) 36 uops 56 uops 60 uops
Integer Rename RF 160 registers 168 registers
FP Rename RF 144 registers 168 registers
Load Buffers 48 entries 64 entries 72 entries
Store Buffers 32 entries 36 entries 42 entries
❑ All x86 architectures so x86 CISC instructions are
decoded into (several) RISC microinstructions (uops)
❑ All three machines are SMT (2 threads) – stay tuned
CMPEN 431 OOO Superscalar.79 Sampson Fall 2019 PSU
❑ 4-wide fetch/decode
❑ 2-way SMT
❑ Decoders convert
x86 to 4 uops / clock
❑ Instr Decode Queue
holds uops from 2
threads – dynamically
partitioned
❑ (Red is what is
changed over Sandy Bridge)
CMPEN 431 OOO Superscalar.80 Sampson Fall 2019 PSU
❑ ROB holds
uops from 2
threads –
dynamically
partitioned
❑ IQ work done
by the unified
Reservation
Station
❑ 8 execution
ports (only 6
on Sandy
Bridge)
CMPEN 431 OOO Superscalar.81 Sampson Fall 2019 PSU
Haswell’s Cache Architecture
❑ All caches have 64B blocks; L1s and L2 private, L3 shared
Metric Nehalem Sandy Bridge Haswell
L1 I$ 32KiB, 4-way 32KiB, 8-way 32KiB, 8-way
L1 D$ 32KiB, 8-way 32KiB, 8-way 32KiB, 8-way
Ld-to-use 4 cycles 4 cycles 4 cycles
Ld bdwdth 16B/cycle 32B/cycle (banked) 64B/cycle
St bdwdth 16B/cycle 16B/cycle 32B/cycle
UL2 256KiB, 8-way 256KiB, 8-way 256KiB, 8-way
Ld-to-use 10 cycles 11 cycles 11 cycles
Bdwdth L1 32B/cycle 32B/cycle 64B/cycle
L1 iTLB 128, 4-way 128, 4-way 128, 4-way
L1 dTLB 64, 4-way 64, 4-way 64, 4-way
L2 uTLB 512, 4-way 512, 4-way 1024, 8-way
CMPEN 431 OOO Superscalar.82 Sampson Fall 2019 PSU
Cortex A8 versus Intel i7
Processor ARM A8 Intel Core i7 920
Market Personal Mobile Device Server, cloud
Thermal design power 2 Watts 130 Watts
Clock rate 1 GHz 2.66 GHz
Cores/Chip 1 4
Floating point? No Yes
Multiple issue? Yes Yes
Peak instructions/clock cycle 2 4
Pipeline stages 14 14
Pipeline schedule Static in-order Dynamic out-of-order
with speculation
Branch prediction 2-level 2-level
1st level caches/core 32 KiB I, 32 KiB D 32 KiB I, 32 KiB D
2nd level caches/core 128-1024 KiB 256 KiB
3rd level caches (shared) - 2- 8 MiB
CMPEN 431 OOO Superscalar.83 Sampson Fall 2019 PSU
Core i7 Pipeline
❑ 4-wide fetch/decode
❑ 2-way SMT
❑ Register alias table
– Map Table
❑ Retirement register
file – ARF
❑ IQ work done by the
unified Reservation
Station
❑ Branch
misprediction costs
17 cycles
CMPEN 431 OOO Superscalar.84 Sampson Fall 2019 PSU
Core i7 Performance
CMPEN 431 OOO Superscalar.85 Sampson Fall 2019 PSU
ARM Cortex A8 Performance (from 4.E)
❑ Ideal CPI is 0.5. For the median case (gcc), 80% of the
stalls are due to pipeline hazards, 20% to memory stalls
CMPEN 431 OOO Superscalar.86 Sampson Fall 2019 PSU
Core i7 Branch Speculation Performance
CMPEN 431 OOO Superscalar.87 Sampson Fall 2019 PSU
Review: Multithreaded Implementations
❑ MT trades (single-thread) latency for throughput
Sharing the datapath degrades the latency of individual threads,
but improves the aggregate latency of both threads
And it improves utilization of the datapath hardware
❑ Main questions: thread scheduling policy and pipeline
partitioning
When to switch from one thread to another?
How exactly do threads share the pipelined datapath itself?
❑ Choices depends on what kind of latencies you want to
tolerate and how much single thread performance you are
willing to sacrifice
Coarse-grain multithreading (CGMT)
Fine-grain multithreading (FGMT)
Simultaneous multithreading (SMT)
CMPEN 431 OOO Superscalar.88 Sampson Fall 2019 PSU
Vertical and Horizontal Under-Utilization
❑ FGMT reduces vertical under-utilization
Loss of all slots in an issue cycle
❑ Does not help with horizontal under-utilization
Loss of some slots in an issue cycle (in a static SS)
SMTStatic SS
ti
m
e
FGMT
CMPEN 431 OOO Superscalar.89 Sampson Fall 2019 PSU
Simultaneous MultiThreading (SMT)
❑ What can issue instr’s from multiple threads in one cycle?
Same thing that issues instr’s from multiple parts of same
program…
…out-of-order execution !!
❑ Simultaneous multithreading (SMT): OoO + FGMT
Aka (by Intel) “hyper-threading”
Once instr’s are renamed, issuer doesn’t care which thread they
come from (well, for non-loads at least)
Some examples
- IBM Power5: 4-way, 2 threads; IBM Power7: 4-way, 4 threads
- Intel Pentium4: 3-way, 2 threads; Intel Core i7: 4-way, 2 threads
- AMD Bulldozer: 4-way, 2 threads
- Alpha 21464: 8-way issue, 4 threads (canceled)
- Notice a pattern? #threads (T) * 2 = #issue width (N)
CMPEN 431 OOO Superscalar.90 Sampson Fall 2019 PSU
SMT Resource Partitioning
❑ Each thread must have its own persistent hard state
structures
Per-thread PC (thread scheduler)
Map Table
ARegFile
❑ No-state (combinational) structures (e.g., ALU) can be
dynamically shared
❑ As with FGMT, TLBs, caches, bpred tables (BHT,BTB) are
already dynamically partitioned (persistent soft state) so
can be shared
Some structures, e.g., TLBs, will need thread ids
Some ordered “soft” state structures (e.g., RAS) will have to be
replicated
CMPEN 431 OOO Superscalar.91 Sampson Fall 2019 PSU
SMT Out-of-order Pipeline
F
e
tc
h
D
e
c
o
d
e
R
e
n
a
m
e
/D
is
p
a
tc
h


C
o
m
m
it
IQ
Is
s
u
e
/P
R
e
g
R
e
a
d
W
ri
te
B
a
c
k
Out-of-order execution
In-order
commit
BTB Ready Table
ROB
E
x
e
c
u
te
E
x
e
c
u
te
Head
Ready
Instr’s
PRegFile
Tail
ARegFile
ITLB
I$
BHT Map Table
D$
Free List
DTLB
LSQ
thread scheduler
ARegFile
Map Table
In-order
CMPEN 431 OOO Superscalar.92 Sampson Fall 2019 PSU
SMT Resource Partitioning, con’t
Transient state structures will need to be partitioned
❑ Execution pipeline latches shared as with FGMT
❑ Free List, PRegFile, Ready Table, and IQ entries can be
partitioned (shared) at the fine grain (entry) level
Physically unordered and so fine-grain sharing is possible
Probably want a bigger PRegFile and IQ
- # physical registers = (#threads * #arch-regs) + #in-flight instr’s
- # Map Table entries = (#threads * #arch-regs)
❑ How are physically ordered structures (ROB, LSQ)
shared?
– Fine-grain sharing (as with IQ) would entangle commit (and
squash on branch misprediction, interrupts)
Allowing threads to commit independently is important, so …
CMPEN 431 OOO Superscalar.93 Sampson Fall 2019 PSU
Static vs Dynamic ROB & LSQ Partitioning
❑ Static partitioning (basically one per thread)
T equal-sized contiguous partitions in the ROB and LSQ
- T is the number of threads
- Essentially equivalent to having a ROB and LSQ for each thread
Could have sub-optimal utilization (fragmentation) as some ROBs
could fill up while others are almost empty
But no starvation (as in dynamic partitioning)
❑ Dynamic partitioning
#partitions > #T, available partitions assigned on need basis
Better utilization
Possible starvation (one thread grabs most/all the partitions, so
other threads are “starved”)
Couple with a fetch policy that gives a preference to threads with
fewest in-flight instr’s
❑ Both need a larger ROB and LSQ
CMPEN 431 OOO Superscalar.94 Sampson Fall 2019 PSU
Multithreading Speed-Ups on the Core i7
❑ Speed-up on PARSEC
1.31 avg
❑ Energy efficiency
improvements
1.07 avg
CMPEN 431 OOO Superscalar.95 Sampson Fall 2019 PSU
Multithreading vs Multicore
❑ If you wanted to run multiple threads would you build a
A multicore: multiple separate pipelines?
A multithreaded processor: a single larger pipeline?
❑ Both will get you throughput on multiple threads
A multicore core will be simpler, possibly faster clock
- Multicore is mainly a TLP (thread-level parallelism) engine
SMT will get you better performance (IPC) on a single thread
- SMT is basically an ILP engine that converts TLP to ILP
❑ Do both
Intel’s Sandy (Ivy) Bridge and Haswell, IBM’s Power7 & 8
4 to 8 OOO 4-way cores each of which supports 2 to 4 threads
(SMT)
Private L1 and (normally) L2 caches, shared L3 cache
3+ GHz clock rate
CMPEN 431 OOO Superscalar.96 Sampson Fall 2019 PSU
Evolution of Pipelined, SS Processors
Year Clock
Rate
# Pipe
Stages
Issue
Width
OOO? Cores
/Chip
Power
Intel 486 1989 25 MHz 5 1 No 1 5 W
Intel Pentium 1993 66 MHz 5 2 No 1 10 W
Intel Pentium
Pro
1997 200 MHz 10 3 Yes 1 29 W
Intel Pentium
4 Willamette
2001 2000 MHz 22 3 Yes 1 75 W
Intel Pentium
4 Prescott
2004 3600 MHz 31 3 Yes 1 (2) 103 W
Intel Core 2006 2930 MHz 14 4 Yes 2 75 W
Intel Core i7 2008 2930 MHz 14 4 Yes 4 (2) 95 W
Sun USPARC
III
2003 1950 MHz 14 4 No 1 90 W
Sun T1
(Niagara)
2005 1200 MHz 6 1 No 8 70 W
CMPEN 431 OOO Superscalar.103 Sampson Fall 2019 PSU
SimpleScalar Structure
❑ sim-outorder: supports out-of-order execution (with
in-order commit) with a Register Update Unit (RUU)
Uses a RUU for register renaming and to hold the results of
pending instructions (our IQ). The RUU also retires (i.e.,
commits) completed instructions (so our ROB) in program
order to the RegFile
Uses a LSQ for store instructions not ready to commit and
load instructions waiting for access to the D$
Loads are satisfied by either the memory or by an earlier store
value residing in the LSQ if their addresses match
- Loads are issued to the memory system only when addresses of
all previous (older) loads and stores are known
CMPEN 431 OOO Superscalar.104 Sampson Fall 2019 PSU
SimpleScalar Pipeline Stage Functions
F
e
tc
h
m
u
lt
ip
le
i
n
s
tr
’s
D
e
c
o
d
e
/R
e
n
a
m
e
a
n
d

D
is
p
a
tc
h
i
n
s
tr
’s
W
a
it
f
o
r
s
o
u
rc
e
o
p
e
ra
n
d
s

to
b
e
R
e
a
d
y
a
n
d
F
U
f
re
e
,
s
c
h
e
d
u
le
R
e
s
u
lt
B
u
s
a
n
d

e
x
e
c
u
te
i
n
s
tr
’s
C
o
p
y
R
e
s
u
lt
B
u
s
d
a
ta
t
o

m
a
tc
h
in
g
w
a
it
in
g
s
o
u
rc
e
s
W
ri
te
d
s
t
c
o
n
te
n
ts
t
o

R
e
g
F
ile
o
r
D
a
ta
M
e
m
o
ry
FETCH
DECODE,
RENAME &
DISPATCH
ISSUE &
EXECUTE
WRITE
BACK
RESULT
COMMIT
In Order In OrderOut of OrderIn Order
ruu_fetch()
ruu_dispatch()
ruu_issue()
lsq_refresh()
ruu_writeback()
ruu_commit()
CMPEN 431 OOO Superscalar.105 Sampson Fall 2019 PSU
SimpleScalar Pipeline
❑ ruu_fetch(): fetches instr’s from one I$ line, puts them
in the fetch queue, probes the cache line predictor to
determine the next I$ line to access in the next cycle
- fetch:ifqsize: fetch width (default is 4)
- fetch:speed: ratio of the front end speed to the execution core
( times as many instructions fetched as decoded per cycle)
- fetch:mplat: branch misprediction latency (default is 3)
❑ ruu_dispatch(): decodes instr’s in the fetch queue,
puts them in the dispatch (scheduler) queue, enters and
links instr’s into the RUU and the LSQ, splits memory
access instructions into two separate instr’s (one to
compute the effective addr and one to access the
memory), notes branch mispredictions
- decode:width: decode width (default is 4)
CMPEN 431 OOO Superscalar.106 Sampson Fall 2019 PSU
SimpleScalar Pipeline, con’t
❑ ruu_issue()and lsq_refresh(): locates and marks
the instr’s ready to be executed by tracking register and
memory dependencies, ready loads are issued to D$
unless there are earlier stores in LSQ with unresolved
addr’s, forwards store values with matching addr to ready
loads
- issue:width: maximum issue width (default is 4)
- ruu:size: RUU capacity in instr’s (default is 16, min is 2)
- lsq:size: LSQ capacity in instr’s (default is 8, min is 2)
and handles instr’s execution – collects all the ready
instr’s from the scheduler queue (up to the issue width),
check on FU availability, checks on access port
availability, schedules writeback events based on FU
latency (hardcoded in fu_config[])
- res:ialu | imult | memport | fpalu | fpmult: number of FU’s
(default is 4 | 1 | 2 | 4 | 1)
CMPEN 431 OOO Superscalar.107 Sampson Fall 2019 PSU
SimpleScalar Pipeline, con’t
❑ ruu_writeback(): determines completed instr’s,
does data forwarding to dependent waiting instr’s,
detects branch misprediction and on misprediction rolls
the machine state back to the checkpoint and discards
erroneously issued instructions
❑ ruu_commit(): in-order commits results for instr’s
(values copied from RUU to RegFile or LSQ to D$),
RUU/LSQ entries for committed instr’s freed; keeps
retiring instructions at the head of RUU that are ready to
commit until the head instr is one that is not ready
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468