辅导案例-EECS 419

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

EECS 419: TAKE HOME TEST 3 3 December 2019
Due: December 11, 2019
Please do all required problems 1-9. You must also do one of problems 10 or 11.
Deliver your answers in Hardcopy form. (source code, results and a README note
delivered by email in a zip file.)
Work your responses to the test problems individually.
Problem 1. – The object of this problem is to explore numerical calculations for the continuum
computation model on mesh or grid type multiple processors. You should model your problem in
C/C++. Assume that you have a square array of points, 16×16, and that the value of the electric
potential function on the boundary is zero on the top row, 20 on the bottom, and 10 along all other
boundary points.
1.1 Initialize the potential function to 0 on all interior points. Calculate the Poisson solution
for the values of all interior points by replacing each interior point with the average value of
each of its neighboring points.
Compute the new values for all interior points before updating any interior points. Run this
simulation for five iterations and show the answers you obtain at the end.
Note: The values on the boundary are fixed and do not change during the computation.
1.2 Repeat the process in the previous problem part 1.1, except update a point as soon as
have computed the new value and use the new value when you reach a neighboring point.
You should scan the interior points row by row from top to bottom and from left to right
within rows.
1.3 The second process seems to converge faster. Give an intuitive explanation of why this
might be the case.
1.4 How do your findings relate to the interconnection structure of a processor for solving
this problem?
Problem 2. – The purpose of this problem is to show the effect of information propagation within
a calculation. Use the Poisson problem (discussed in Problem 1 previously) and write a computer
program in C/C++ that iterates until no point value changes by more than 0.1 percent. Let this be
the initial state of the problem for the following problem parts below, 2.1, 2.2 and 2.3.
2.1 Increase the boundary point at the upper left corner to a new value of 20. Perform five
iterations of the Poisson solver and observe the values obtained.
2.2 Now restore the mesh to the initial state for problem 2.1. Change the problem so that, in
effect, the upper left corner is rotated to the bottom right corner. To do this, scan the rows
from right to left instead of left to right and scan from bottom to top instead of top to bottom.
Perform five iterations of the Poisson solver and observe the values obtained.
2.3 Both problem parts 2.1 and 2.2 eventually converge to the same solution because the
initial data are the same and the physical process modeled is the same. However, the results
obtained from problem parts 2.1 and 2.2 are different after five iterations of the Poisson
solver. Explain why they are different. Which of the two problems has faster convergence?
Why?
Problem 3. – Domain decomposition is a well known technique to solve parallel numerical
problems for large computational grids, i.e. over 106 grid points. The idea is illustrated in Figure
3.1. The original grid is partitioned into small domains of subgrids and then the numerical problem
is solved in parallel in each domain using a uniprocessor architecture. Care should be taken though
to communicate results of processors located in neighboring domains as the computation goes on.
As the number of processors and hence domains is usually rather small, say an order of less than
100, this technique provides modest speedup because it suffers from communication bottlenecks at
the processor domain interfaces. However, the cost of resources is reasonable provided the number
of processors is kept relatively small.
P1 P2
P4P3
Figure 3.1: Domain decomposition Figure 3.2: Mesh of meshes arch
It is proposed to modify the domain decomposition by a parallel mesh-of-meshes architec-
ture style as shown in Figure 3.2 for a 2× 2 domain decomposition. Every grid point of each
2
domain will contain an embedded virtual processor, however, only the top left corner domain,
Domain(0,0), will contain a corresponding mesh of physical processors. The computational
model is simple. The processor mesh computes in parallel Domain(0,0) then it switches onto
Domain(0,1), Domain(0,2) and so on, sweeping through the domains horizontally and vertically,
step by step, until it reaches the bottom right corner domain.
In general, let N×N be the original grid matrix which is partitioned into Q×Q domains such
that every domain has M×M grid nodes where N = M×Q. Thus every full iteration through the
entire grid consists of Q2 iterations through each of the Q2 domains.
To achieves this, every physical processor node should have sufficient local storage to hold
the intermediate results of corresponding nodes when going from domain to domain. However,
to allow for inter communications between near neighboring domain nodes, we will use a wrap
around or torus structure for the physical processor mesh shown in Fig. 3.3.
In what follows we assume that we will use the above Q×Q architecture Figs 3.2, 3.3 to solve
the Poisson potential equation on a N×N grid. The context switching time from domain to domain
may be few cycles, say 5 (instruction) cycles. The grid has been initialized and the initials of every
node are stored in the corresponding processors. Specifically, the initial potential value of the top
grid nodes is 100 whereas all others are initialized to zero.
P12
P8
P4
P0 P1 P2 P3
P5 P6 P7
P11
P15
P10
P14P13
P9
Figure 3.3: Mesh domain architecture
3.1 Formalize the computational model in the form of an algorithmic procedure for the Pois-
son potential equation assuming we use the standard iterative solver.
3.2 Estimate the computation time for your solution assuming that the problem converges
after 1000 iterations.
3.3 Estimate the speedup of the proposed approach (if any) in comparison to the standard
3
domain decomposition technique in Figure 3.1. To be fair you should use the same number
of processors Q2 and same grid size N×N for the domain decomposition in Figure 3.1.
3.4 You should write a C/C++ model for your Poisson solver and simulate your solution to
find the convergence values of all nodes in the mesh assuming that the top boundary initial
values do not change. Assume that N = 104 and Q= 103.
3.5 Get some additional results for different grid sizes N×N. However, only a fraction of
your results will be needed in your answers.
Problem 4. – Consider a 5-dimensional hypercube system, Hyper-5. Assume that you are to
calculate all partial sums of the i items up to the sum of 32 items.
4.1 Construct a program for Hyper-5 that performs this operation in a time that grows as
O(logN) if the number of processors is N. Assume that every processor node in the computer
executes the same program, although the program can be slightly different from node to node
since the processors in the Hyper-5 are independent.
4.2 Show explicitly the instructions that send and receive data between processors. Invent
instructions as you need them and describe what the instructions do. Include some type
of instructions for synchronization that forces a processor to be idle until a neighboring
processor sends a message sends a message or datum that enables computation to continue.
4.3 Which communication steps if any in your answers require communications with pro-
cessors that are not among the five processors directly connected to a given processor? How
do you propose to implement such communication in software (assume that the hardware
itself does not provide remote communication as a basic instruction) ?
Problem 5. Consider a homogeneous multicore machine containing two single-thread processor
cores and four shared memory modules, M1, M2, M3, M4. The cores and memories are connected
via a 2× 4 cross bar switch. A program to be executed on the multicore system is represented
by the data dependency graph of the following figure involving the instructions I1, I2, ..., I12. We
assume instruction execution time 1 cycle, cross bar time 2 cycles and memory reference time 2
cycles. We also assume that the instruction execution and memory reference is consolidated or
not pipelined. The memory access requirements for each instruction are given in the table. We do
consider memory conflicts and cross bar conflicts.
There is a host processor which holds initially the instruction program. The host is responsible
for loading this program on the multicore system. Based on the data dependencies, the host will
load instruction parts to each core for parallel execution. Because of this loading, the cores do not
go through the instruction fetch phase. Assume the decoding time is subsumed by the loading.
4
Then for every instruction, each core needs to access the memory through the crossbar, follow the
access patterns, and get the data back to the core again through the crossbar. At the end of the
computation, the results are in the registers of each core.
5.1 Make a suitable allocation mapping of instructions into the processors of the multicore
system which requires as short as possible computation time.
5.2 Design an algorithm that will systematically solve this problem in this environment, i.e
the 2-processor, 2× 4 cross bar and 4-memory homogeneous multicore system, assuming
acyclic program dependency graph.
5.3 Suppose that the single threaded 2-core system is replaced by 2-threaded single processor
core, keeping the four memory modules and replacing the cross bar by a bus. However, the
instruction execution time is now 2 cycles, the bus transport time averages 1 cycle and the
memory reference time remains 2 cycles. Consider that the data flow program graph consists
of two program threads, i.e.
T1 : I1, I3, I4, I5, I6, I7, I8, I10 and T2 : I2, I9, I11, I12
Redo the above mapping problem 5.1 for the 2-threaded single core system assuming an
additional 1 cycle context-switching penalty. Compare the results to the single threaded
2-core system, problem 5.1.
5
M1 M2 M3 M4
Core 1 Core 2
Multicore Architecture
Host System
1
23
5 6
7 8 9
10 11
4
12
1
2
3
4
5
6
7
8
9
10
11
12
X
X X
X
X
X
X
X
X
X
X
X
X
X
X
Data Dependency Graph Memory Access Patterns
M 1 M 2 M 3 M 4
Problem 6. – A multiprocessor node must sometimes send a message to more than one other
processor, a task referred to as ”broadcasting”. Suppose that a node P0 in an n-dimensional hyper-
cube system has to broadcast a message to all 2n − 1 other processors. The broadcasting is subject
to the constraints that the message can be forwarded (retransmitted) by a node only to a neigh-
6
boring node, and each node can transmit only one message at a time. Assume that each message
transmission between adjacent nodes requires only one time unit. In a two-dimensional system,
for example, P0 could broadcast a message MESS as follows. At time t = 0, P0 sends MESS to P1.
At t = 1, P0 sends MESS to P2 and P1 sends MESS to P3, thus completing the broadcast in 2 time
units.
6.1 Construct a general broadcasting algorithm for the n-dimensional case that allows a mes-
sage to reach all nodes in n time units. Specify clearly the algorithm used by each node to
determine the neighboring nodes to which it should forward an incoming message.
6.2 Apply your algorithm on the designing the 8× 8 heat-flow program (Laplace equation)
to run on a 64-processor hypercube computer that simulates a mesh computer. Specify com-
pletely the function you use to map the problem’s gridpoints onto the nodes of the hypercube.
6.3 Simulate the 16×16 heat-flow problem, i.e. build a C/C++ program to simulate the oper-
ation of a net of 2×2 processor nodes, assuming reasonable communication cost. Tabulate
your results. Comment how would the problem scale to a 256×256 heat-flow problem?
Problem 7. – Consider the following arithmetic expression:
F = (A0 + A1 + · · ·+ A7) × (B0 + B1 + · · ·+ B7)
7.1 Show a minimal height tree implementation of F assuming unlimited processors. Com-
pute the speedup S1 and the efficiency E1.
7.2 Find a minimal height tree implementation of F assuming 4 (four) processors available
each being able to execute the addition or multiplication operation in one time step. Compute
the speedup and efficiency, S2 and E2, respectively.
7.3 Suppose you want to implement F using a 4× 4 mesh connected network. a) Describe
a procedure that implements F in minimal time on the mesh network, also find the speedup
and efficiency of your parallel mesh implementation. b) If your boundary processors are
connected by end-around or wrap-around connections (toroidal connections), show how your
solution improves in terms of the speedup and efficiency.
7.4 Suppose you want to implement F using 8 (eight) processors connected into a 8 node
shuffle network using 4 switching elements and feedback paths (similar network shown in
class). Describe an algorithm that implements F in minimal time using this shuffle network.
Assume that the switches can shuffle a data set in one time step, during the same time the
processors perform an operation on another data set. Compute the speedup and efficiency,
S3 and E3, respectively.
7
7.5 Suppose that we want to implement F using a 4-dimensional hypercube network. As-
sume that the 16 variables Ai and Bi, i= 0,1,...,7 have already been loaded to the 16 nodes of
the cube. Describe an algorithm that will implement F in minimal time using this hypercube
network. Compute S4 and E4.
7.6 Discuss the pros and cons of the above 5 implementations.
Problem 8. – The purpose of this problem is to explore some of the properties of the perfect
shuffle scheme.
8.1 Consider a processor that has the perfect shuffle and pair-wise exchange connections
shown in Figure bellow. For an eight-processor system, show that the permutation that
cyclically shifts the input vector by three positions is realized by some setting of the ex-
change modules. Draw the network unrolled in time to show the setting that realizes this
permutation.
8.2 Repeat the above problem question to show that a cyclical shift of two positions is real-
izable.
8.3 Prove a shuffle-exchange network can realize any cyclical shift in log2N iterations for an
N-processor system when N is a power of 2.
Perfect shuffle network. Tall boxes depict modules that can exchage their inputs.
Shaded modules are reached from Processor 3 via some sequence of shuffle−exchange
8
Problem 9. – Consider the use of a four-PE array processor (PE means processing element or
processor) to multiply two 3×3 matrices. The interconnection structure of the four PEs is shown
in Figure below. Wraparound connections appear in all rows and columns of the array. You need
to map the matrix elements initially one onto each of the processors. All the three multiplications
needed for each output element ci j must be performed in the same PE in order to accumulate the
sum of products. Of course, you are allowed to shift the matrix elements around if needed.
PE 3 PE 4
PE 1
X B = CA
PE 2
9.1 Show the initial mapping of the A and B matrix elements to the processors before the
first multiplication is carried out. (You may have to wrap around the matrix.)
9.2 What are the initial multiplications to be carried out in each processor (there may be
more than one multiplication in each processor) without any data shifting?
9.3 Parallel shifts are carried out in the horizontal and vertical directions. Show the mapping
of the A and B matrix elements before the second group of multiplications can be carried
out.
9.4 What are the multiplications to be carried out in each processor without any further
shifting? (Don’t bother with summing with the previous terms. Summation operations in
dot product operations are embeded in the multiply hardware automatically.)
9.5 Suppose the two matrices have already been allocated as in problem part 9.1. Assume
each processor can perform one multiplication per unit time or one shift in a single direction
9
per unit time for each number. (If you shift two numbers, it takes two units of time.) Deter-
mine the time units needed to complete parts 9.1, 9.2, 9.3, 9.4, respectively. Minimizing the
total time delay is the design goal.
Problem 10. Consruct a shuffle network to compute a second degree polynomial in one variable:
Q(x) = Ax2 + Bx + C. Now consider a second degree polynomial in two variables:
P(x,y) = A2x2y2 + A1x2y + A0x2 + B2xy2 + B1xy + B0x + C2y2 + C1y + C0 (1)
It is desired to evaluate this polynomial on the basis of the shuffle construction for polynomials
in one variable. To do this, you may apply the following transformation:
P(x,y) = Q2(x)y2 + Q1(x)y + Q0(x) (2)
You may assume that that you have four processors available so that each processor can execute
three additions and em multiplications, concurrently. Masking of these operations is possible and
the masks should be shuffled through the computation network. Broadcasting of x and y powers is
possible from the last processor. The four processors are interconnected by a) shuffle connections;
b) linear shift connections (to perform inter-processor additions).
10.1 Construct a shuffle/exchange network to evaluate the second degree polynomial P(x,y).
10.2 Summarize all the steps taken to solve above question.
Problem 11. – Consider the following polynomial in two variables.
P(x,y) = Ax3 +Bx2y + Cxy2 + Dy3
It is desired to evaluate this polynomial on the basis of the shuffle construction for polynomials
in one variable. You may assume that that you have up to eight processors available and each
processor can execute additions and multiplications. The variables x and y are initially shifted into
the processors together with the cofficients A, B, C, D. The processors are interconnected by a)
shuffle connections; b) linear shift connections. Appropriate values are to be shuffled through the
network as arguments of addition or multiplication operations performed by the processors.
11.1 Construct a shuffle/exchange network to evaluate the polynomial P(x,y).
11.2 Summarize all the steps taken to solve the above question.
11.3 Calculate the number of computation time units required to evaluate the polynomial
P(x,y). Assume that it takes one time unit for an addition or multiplication within a proces-
sor, and one time unit for communication or broadcasting between two processors.
11.4 What is the minimum number of processors, and hence the simplest shuffle network, to
solve this problem.
10