辅导案例-CMPEN 431

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

CMPEN 431 Chapter 6B.1 Sampson, PSU, 2019
CMPEN 431
Computer Architecture
Fall 2019
Chapter 6B: Introduction to
Message Passing Multiprocessors
[Adapted from Computer Organization and Design, 5th Edition,
Patterson & Hennessy, © 2014, MK and Mary Jane Irwin]
CMPEN 431 Chapter 6B.2 Sampson, PSU, 2019
Review: Shared Memory Multiprocessors (SMP)
❑ Q1 – Single address space shared by all cores
❑ Q2 – Cores coordinate/communicate through shared
variables in memory (via loads and stores)
Use of shared data must be coordinated via synchronization
primitives (locks) that allow access to data to only one core at a
time
❑ SMPs come in two styles
Uniform memory access (UMA) multiprocessors
Nonuniform memory access (NUMA) multiprocessors
Core Core Core
Cache Cache Cache
Interconnection Network
Memory I/O
CMPEN 431 Chapter 6B.3 Sampson, PSU, 2019
Message Passing Multiprocessors (MPP)
❑ Each core has its own private address space
❑ Q1 – Cores share data by explicitly sending and
receiving information (message passing)
❑ Q2 – Coordination is built into message passing
primitives (message send and message receive)
Cores Cores Cores
Cache Cache Cache
Interconnection Network
Memory Memory Memory
CMPEN 431 Chapter 6B.4 Sampson, PSU, 2019
Communication in Network Connected Multi’s
❑ Implicit communication via loads and stores (SMP)
hardware architects have to provide coherent caches and
process (thread) synchronization primitives (like ll and sc)
lower communication overhead
harder to overlap computation with communication
more efficient to use an address to get remote data when
needed rather than to send for it in case it might be needed
❑ Explicit communication via sends and receives (MPP)
simplest solution for hardware architects
higher communication overhead
easier to overlap computation with communication
easier for the programmer to optimize communication
CMPEN 431 Chapter 6B.5 Sampson, PSU, 2019
SMP/MPP Example – Intel Xeon Phi Coprocessor
❑ Intel’s Many Integrated Core Architecture (MIC, Mike)
Up to 8 coprocessors (72 cores each) per host server, SMP within a
coprocessor, MPP between coprocessors
❑ Three generations: Knight’s Ferry (3100), Knight’s Corner
(5100), Knight’s Landing (7100)
Knight’s Landing in 14nm FinFETs, 2nd + quarter 2015
72 Atom (Silvermont) cores, static 2-way superscalar, 4 threads per
core (FGMT) so 288 threads, 1.238GHz (1.33GHz Turbo mode)
Each core has two 512-bit vector units and supports AVX-512F
SIMD instructions
On-chip interconnect, mesh NoC ?
Up to 384GB of DDR4DRAM and 16GB of stacked 3D MCDRAM
3+ TeraFLOPS per coprocessor
TDP of 300W (estimated 15W per core, so how ??)
❑ Programming tools: OpenMP, OpenCL, Cilk, and specialized
versions of Intel's Fortran, C++ and scientific libraries
CMPEN 431 Chapter 6B.6 Sampson, PSU, 2019
Xeon Phi Knight’s Landing
http://www.anandtech.com/show/8217/intels-knights-landing-coprocessor-detailed
CMPEN 431 Chapter 6B.7 Sampson, PSU, 2019
Summing 100,000 Numbers on 100 Core MPP
sum = 0;
for (i = 0; i<1000; i = i + 1)
sum = sum + Al[i]; /* sum local array subset
❑ Start by distributing 1000 elements of vector A to each of
the local memories and summing each subset in parallel
❑ The cores then coordinate in adding together the sub
sums (Cn is the number of cores, send(x,y) sends
value y to core x, and receive() receives a value)
half = 100;
limit = 100;
repeat
half = (half+1)/2; /*dividing line
if (Cn>= half && Cnif (Cn<(limit/2)) sum = sum + receive();
limit = half;
until (half == 1); /*final sum in C0’s sum
CMPEN 431 Chapter 6B.9 Sampson, PSU, 2019
An Example with 10 Cores
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9
C0 C1 C2 C3 C4
half = 10
half = 5
half = 3
half = 2
sum sum sum sum sum sum sum sum sum sum
send
receive
C0 C1 C2
limit = 10
limit = 5
limit = 3
limit = 2
half = 1
C0 C1
C0
send
receive
send
receive
send
receive
❑ Key is how long it takes to
send packets back and forth
across the network
Software protocol stack
(CmpEn 362)
Hardware interconnection
network and traffic load
CMPEN 431 Chapter 6B.10 Sampson, PSU, 2019
Pros and Cons of Message Passing
❑ Message sending and receiving is much slower than
addition, for example
❑ But message passing multiprocessors are much easier
for hardware architects to design
Don’t have to worry about cache coherency for example
❑ The advantage for programmers is that communication is
explicit, so there are fewer “performance surprises” than
with the implicit communication in cache-coherent SMPs
Message passing standard MPI-2.2 (www.mpi-forum.org)
❑ However, its harder to port a sequential program to a
message passing multiprocessor since every
communication must be identified in advance
With cache-coherent shared memory the hardware figures out
what data needs to be communicated
CMPEN 431 Chapter 6B.11 Sampson, PSU, 2019
Aside: Quick Summary of MPI
❑ The MPI Standard describes
point-to-point message-passing
collective communications
group and communicator concepts
process topologies
environmental management
process creation and management
one-sided communications
extended collective operations
external interfaces
I/O functions
a profiling interface
❑ Language bindings for C, C++ and Fortran are defined
http://www.mpi-
forum.org/docs/docs.html
CMPEN 431 Chapter 6B.12 Sampson, PSU, 2019
Concurrency and Parallelism
❑ Programs are designed to be sequential or concurrent
Sequential – only one activity, behaving in the “usual” way
Concurrent – multiple, simultaneous activities, designed as
independent operations or as cooperating threads or processes
- The various parts of a concurrent program need not execute
simultaneously, or in a particular sequence, but they do need to
coordinate their activities by exchanging information in some way
❑ A key challenge is to build parallel (concurrent) programs
that have high performance on multiprocessors as the
number of cores increase – programs that scale
Problems that arise
- Scheduling threads on cores close to the memory space where their
data primarily resides
- Load balancing threads on cores and dealing with thermal hot-spots
- Time for synchronization of threads
- Overhead for communication of threads
CMPEN 431 Chapter 6B.15 Sampson, PSU, 2019
Encountering Amdahl’s Law
❑ Speedup due to enhancement E is
Speedup w/ E = ----------------------
Exec time w/o E
Exec time w/ E
❑ Suppose that enhancement E accelerates a fraction F
(F <1) of the task by a factor S (S>1) and the remainder
of the task is unaffected
ExTime w/ E = ExTime w/o E  ((1-F) + F/S)
Speedup w/ E = 1 / ((1-F) + F/S)
CMPEN 431 Chapter 6B.17 Sampson, PSU, 2019
Example 1: Amdahl’s Law
❑ Consider an enhancement which runs 20 times faster
but which is only usable 25% of the time.
Speedup w/ E = 1/(.75 + .25/20) = 1.31
❑ What if its usable only 15% of the time?
Speedup w/ E = 1/(.85 + .15/20) = 1.17
❑ Amdahl’s Law tells us that to achieve linear speedup
with 100 cores (so 100 times faster), none of the
original computation can be scalar!
❑ To get a speedup of 90 from 100 cores, the
percentage of the original program that could be scalar
would have to be 0.1% or less
Speedup w/ E = 1/(.001 + .999/100) = 90.99
Speedup w/ E = 1 / ((1-F) + F/S)
CMPEN 431 Chapter 6B.19 Sampson, PSU, 2019
Example 2: Amdahl’s Law
❑ Consider summing 10 scalar variables and two 10 by
10 matrices (matrix sum) on 10 cores
Speedup w/ E = 1/(.091 + .909/10) = 1/0.1819 = 5.5
❑ What if there are 100 cores?
Speedup w/ E = 1/(.091 + .909/100) = 1/0.10009 = 10.0
❑ What if the matrices are 100 by 100 (or 10,010 adds in
total) on 10 cores?
Speedup w/ E = 1/(.001 + .999/10) = 1/0.1009 = 9.9
❑ What if there are 100 cores?
Speedup w/ E = 1/(.001 + .999/100) = 1/0.01099 = 91
Speedup w/ E = 1 / ((1-F) + F/S)
CMPEN 431 Chapter 6B.20 Sampson, PSU, 2019
Multiprocessor Scaling
❑ To get good speedup on a multiprocessor while keeping
the problem size fixed is harder than getting good
speedup by increasing the size of the problem
Strong scaling – when good speedup is achieved on a
multiprocessor without increasing the size of the problem
Weak scaling – when good speedup is achieved on a
multiprocessor by increasing the size of the problem
proportionally to the increase in the number of cores and the
total size of memory
❑ But Amdahl was an optimist – you probably will need
extra time to patch together parts of the computation that
were done in parallel
CMPEN 431 Chapter 6B.21 Sampson, PSU, 2019
Multiprocessor Benchmarks
Scaling? Reprogram? Description
LINPACK Weak Yes Dense matrix linear algebra
http://www.top500.org/project/linpack/
SPECrate Weak No Parallel SPEC programs for job-
level parallelism
SPLASH 2 Strong No Independent job parallelism (both
kernels and applications, from
high-performance computing)
NAS Parallel Weak Yes (c or
Fortran)
Five kernels, mostly from
computational fluid dynamics
PARSEC Weak No Multithreaded programs that use
Pthreads and OpenMP. Nine
applications and 3 kernels – 8
with data parallelism, 3 with
pipelined parallelism
Berkeley
Patterns
Strong or
Weak
Yes 13 design patterns implemented
by frameworks or kernels
CMPEN 431 Chapter 6B.22 Sampson, PSU, 2019
DGEMM (Double precision GEneral MM) Example
❑ DGEMM: A BLAS (Basic Linear Algebra Subprograms)
routine; part of LINPACK used for performance
measurements
C = C + A * B
void dgemm (int n, double* A, double* B, double* C)
{ for (int i = 0; i < n; ++i)
for (int j = 0; j < n; ++j)
{ double cij = C[i+j*n]; /* cij=C[i][j] */
for (int k = 0; k < n; ++k)
cij += A[i+k*n] * B[k+j*n];
/* cij += A[i][j] * B[i][j] */
C[i+j*n] = cij; /*C[i][j] = cij */
}
}
CMPEN 431 Chapter 6B.23 Sampson, PSU, 2019
Multithreaded, Blocked OpenMP DGEMM
❑ #pragma OpenMP code makes the outmost for loop
operate in parallel
.
.
.
void dgemm (int n, double* A, double* B, double* C)
{
#pragma omp parallel for
for ( int sj = 0; sj < n; sj += BLOCKSIZE )
for ( int si = 0; si < n; si += BLOCKSIZE )
for ( int sk = 0; sk < n; sk += BLOCKSIZE )
do_block(n, si, sj, sk, A, B, C);
CMPEN 431 Chapter 6B.24 Sampson, PSU, 2019
DGEMM Scaling: Thread Count, Matrix Size
CMPEN 431 Chapter 6B.25 Sampson, PSU, 2019
Multiprocessor Basics
# of Cores
Communication
model
Message passing 8 to 2048 +
SMP NUMA 8 to 256 +
UMA 2 to 32
Physical
connection
Network 8 to 256 +
Bus 2 to 8
❑ Q1 – How do they share data?
A single physical address space shared by all cores or message
passing
❑ Q2 – How do they coordinate?
• Through atomic operations on shared variables in memory (via loads
and stores) or via message passing
❑ Q3 – How scalable is the architecture? How many cores?
CMPEN 431 Chapter 6B.29 Sampson, PSU, 2019
Multiple types of parallelism: DGEMM Using AVX
❑ Taking advantage of subword parallelism (AVX),
instruction level parallelism (compiler loop unrolling), and
caches (matrix blocking)
CMPEN 431 Chapter 6B.34 Sampson, PSU, 2019
Speedup Versus Parallelism
❑ For x86
processors
❑ Assumptions
MIMD: 2 cores
per chip added
every two
years
SIMD: SIMD
widths
(registers and
ALUs) will
double every
four years
CMPEN 431 Chapter 6B.36 Sampson, PSU, 2019
GPU Architectures
❑GPUs are highly multithreaded (data-parallel)
Use thread switching to hide memory latency
- Less reliance on multi-level caches
Graphics memory is wide and high-bandwidth
❑Trend is toward general purpose GPUs
Heterogeneous CPU/GPU systems
CPU for sequential code, GPU for parallel code
❑Programming languages/APIs
DirectX, OpenGL
C for Graphics (Cg), High Level Shader Language
(HLSL)
NVIDIA’s Compute Unified Device Architecture
(CUDA)
CMPEN 431 Chapter 6B.37 Sampson, PSU, 2019
NVIDIA Tesla
14 x Streaming
Multiprocessor
8 x Streaming
Processor
CMPEN 431 Chapter 6B.38 Sampson, PSU, 2019
Apples A6 Processor (2 CPUs, 3 GPUs)
CMPEN 431 Chapter 6B.39 Sampson, PSU, 2019
Intel’s Broadwell-L
❑ Tock node at
14nm FinFET
82mm2 die
4.5W TDP
❑ Haswell’s “tick”
partner
IPC 5% faster
Larger ROB
Larger BP tables
Larger L2 TLB
(1K to 1.5K
entries)
New L2 TLB for
large pages (16
entries for 1GB
pages) From MPR, Sep 2014
CMPEN 431 Chapter 6B.40 Sampson, PSU, 2019
Graphics Performance
From MPR, Sep 2014
CMPEN 431 Chapter 6B.41 Sampson, PSU, 2019
Intel’s Gen8 GPU Architecture
From MPR, Sep 2014
CMPEN 431 Chapter 6B.46 Sampson, PSU, 2019
Loosely Coupled Clusters
❑ Multiple off-the-shelf computers (each with its own private
address space and OS) connected via a local area
network (LAN) functioning as a “single” multiprocessor
Search engines, Web servers, email servers, databases, …
The current trend is toward grid or cloud computing, using wide-
or global-area networks, perhaps borrowing from other
participants, or selling a service as a subset of an existing
corporate network
❑ N OS copies limits the memory space for applications
❑ Improved system availability and expandability
easy to replace a machine without bringing down the whole
system
allows rapid, incremental expandability
❑ Economy-of-scale advantages with respect to costs
CMPEN 431 Chapter 6B.48 Sampson, PSU, 2019
Top500 Architecture Styles
❑ Uniprocessor – one core only
On the Top500 list, a vector processor
❑ SIMD – one control core, many compute units
❑ SMP – Shared Memory multiProcessors
❑ MPP – Message Passing multiProcessors
❑ Constellation – a collection of network connected SMPs
❑ Clusters – collection of independent, network connected
PCs (commodity processors and memory), either
Separate (not coherent) address spaces, communicate via
message passing, commodity network
Single (coherent) address space, communicate via direct inter-
node memory access, custom inter-node network
CMPEN 431 Chapter 6B.49 Sampson, PSU, 2019
Top500 Architecture Styles, 1993-2009
❑ Uniprocessors and SIMDs disappeared while Clusters
and Constellations grew from <3% to 80%. Now its 99%
Clusters and MPPs.
Nov 2009 datahttp://www.top500.org/
0
100
200
300
400
500
1
9
9
3
1
9
9
4
1
9
9
5
1
9
9
6
1
9
9
7
1
9
9
8
1
9
9
9
2
0
0
0
2
0
0
1
2
0
0
2
2
0
0
3
2
0
0
4
2
0
0
5
2
0
0
6
2
0
0
7
2
0
0
8
2
0
0
9
Cluster
Constellation
SIMD
MPP
SMP
Uniprocessor
CMPEN 431 Chapter 6B.50 Sampson, PSU, 2019
From the Top500 Lists and Earlier …
❑ First Megaflop (106) system – CDC 7600, 1971
in today’s vocabulary, a superscalar processor
❑ First Gigaflop (109) system – Cray-2, 1986
a vector processor, 4 CPUs, 2 GB memory
❑ First Teraflop (1012) system – Intel ASCI Red, built for Sandia
National Laboratories, Albuquerque, 1996
7,264 Pentium Pro processors (200 MHz), later increased to 9,632
Pentium II OverDrive processors (333 MHz)
❑ First Petaflop (1015) system – IBM Roadrunner, built for Los Alamos
National Laboratory, 2008
12,960 IBM Cell + 6,480 AMD Opteron processors, all multicore
❑ First Exaflop (1018) system – 2020?
Oak Ridge National Laboratory performed a 1.8×1018 operation calculation per second (which is
not the same as 1.8×1018 flops) on the Summit OLCF-4 Supercomputer
❑ Top this year – https://www.top500.org/lists/2019/06/
http://en.wikipedia
.org/wiki/Cray-2
http://en.wikipedia.o
rg/wiki/CDC_7600
http://en.wikipedia.org/wiki/IBM_Roadrunner