程序代写案例-COMP5426

COMP5426 Parallel and Distributed Computing Final Exam Semester 1 2021

Page 1 of 4
Question 1 (50 Marks)
(This section contains 1
5 short-answer questions.)
1) Give three methods for increasing the speed of a computer system. [3]

2) Classify MIMD computers based on memory organization and communication
methods. [3]

3) Describe at least FOUR characteristics of shared-memory multiprocessors that
distinguish them from distributed-memory multicomputers on the basis of hardware
architecture, data sharing and communication/synchronization. [4]

4) Give a definition of cloud computing and briefly describe its main characteristics. [4]
5) What are the major performance metrics for characterizing an interconnection
network, and what are the main purposes of using these metrics? [4]
6) What is the execution time, in general, relate to the standard all-to-all reduction
operation without broadcasting on a two-dimensional torus with a total number of
compute nodes p. [4]
7) What is a non-blocking network? Give two examples. [3]
8) Parallel paths are a number of distinct paths collecting nodes A and B and these paths
have no common nodes other than A and B. What is the maximum number of possible
parallel paths between any two nodes in a d-dimensional hypercube? [4]
9) Will it take longer time to perform a one-to-all broadcast operation on a one-
dimensional array of size than that on a two-dimensional mash of size if
per-hop time is ignored? You must give reasons to justify your answer. [4]
10) Prove that if total overhead for a given problem size, then the parallel
execution time will continue to decrease as is increased and asymptotically
approach a constant value. [4]
11) What is race condition? Explain race condition with two examples. [3]
12) In the context of a GPU program using CUDA, what are (i) threads; (ii) blocks; and
(iii) grids? [3]

13) In CUDA programming, how many different kinds of memories are in a GPU? [2]

14) Can functions annotated with __global__ be executed on the host? [2]

15) What does coalesced / un-coalesced memory access mean? [3]





COMP5426 Parallel and Distributed Computing Final Exam Semester 1 2021

Page 2 of 4
Question 2 (10 Marks)
Suppose that MPI_COMM_WORLD consists of three processes with rank 0, 1, and 2,
respectively. Suppose the following code is executed:

int x, y, z;
switch(my_rank) {
case 0: x=1; y=4; z=2;
MPI_Bcast(&x, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Send(&y, 1, MPI_INT, 2, 2, MPI_COMM_WORLD);
MPI_Recv(&z, 1, MPI_INT, 1, 1, MPI_COMM_WORLD, &status);
MPI_Allreduce(&y, &x, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
break;
case 1: x=3; y=8; z=6;
MPI_Bcast(&y, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Send(&z, 1, MPI_INT, 2, 1, MPI_COMM_WORLD);
MPI_Send(&x, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
MPI_Allreduce(&z, &y, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
break;
case 2: x=6; y=7; z=8;
MPI_Bcast(&x, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Recv(&y, 1, MPI_INT, 0, 2, MPI_COMM_WORLD, &status);
MPI_Recv(&z, 1, MPI_INT, 1, 1, MPI_COMM_WORLD, &status);
MPI_Allreduce(&x, &y, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
break;
}
What are the values of x, y, and z on each process after the code has been executed?


QUESTION 3 (10 Marks)
A counting semaphore is a synchronization mechanism widely used in practice. It has an
integer variable used for signaling among threads. Only three atomic operations may be
performed on a semaphore:
i. sem_Init: The integer variable is initially assigned a zero or positive value.
ii. sem_Wait: The integer value is decremented by 1; if the value is smaller than zero,
the calling thread will be blocked.
iii. sem_Signal: The integer value is incremented by 1; if the value is smaller than or
equal to zero, unblock one of the waiting thread.
You are asked to implement a simple counting semaphore using pthread mutex and
condition variable.
A counting semaphore type may be defined as
typedef struct {
int count; /* the integer variable */
pthread_mutex_t s_m; /* mutex */
pthread_cond_t s_cv; /* condition variable */
} mylib_sem_t;
You must implement the following THREE functions:
mylib_sem_init(mylib_sem_t *s, int value)/*sem_Init as discussed above*/
mylib_sem_wait(mylib_sem_t *s) /*sem_Wait operation as discussed above*/
mylib_sem_signal(mylib_sem_t *s) /* sem_Signal as discussed above */
COMP5426 Parallel and Distributed Computing Final Exam Semester 1 2021

Page 3 of 4
Question 4 (10 Marks)
Consider a CUDA kernel routine reduceOp() below. This routine performs a parallel
reduction operation, i.e., parallel summation of an array of integers with n elements. The
routine does not make use of shared memory in each SM (or Stream Multiprocessor) and
the reduction is done in-place, which means that the values in global memory are replaced
by partial sums at each step.

__global__ void reduceOp(int *g_idata, int *g_odata, unsigned int n)
{
// set thread ID
unsigned int tid = threadIdx.x;
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;

// convert global data pointer to the local pointer of this block
int *idata = g_idata + blockIdx.x*blockDim.x;

// boundary check
if(idx >= n) return;

// in-place reduction in global memory
for (int stride = 1; stride < blockDim.x; stride *= 2) {
int index = 2 * stride * tid;
if (index < blockDim.x) {
idata[index] += idata[index + stride];
}

// synchronize within threadblock
__syncthreads();
}

// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = idata[0];
}

You need to answer the following questions after analyzing the routine.
1) Describe how the parallel algorithm adopted by this routine works.
2) Discuss if there are any shortcomings in this algorithm (in addition to the use of
shared memory).
3) Based on your discussion, modify the routine to make it work more efficiently still for
in-place reduction in global memory without making use of shared memory in each
SM.
4) Justify your modifications, i.e., discuss why your modifications can enhance the
performance of the original routine.








COMP5426 Parallel and Distributed Computing Final Exam Semester 1 2021

Page 4 of 4
Question 5 (20 Marks)
Relational operator “.>” performs element-wise comparisons between two arrays of equal
length and the results is a logical array of the same length indicating the locations where
the relation “>” (i.e., greater than) is true. Given two integer arrays and
, for example, . The total number of locations
where the relation is true is in this example. (Note in general
.)
1) Given a set of arrays, each being of length , you are asked to design a parallel
algorithm to find the array pairs which have the largest value for in a
distributed memory system of processing elements organized as a one-dimensional
ring for . (Note there may be multiple such array pairs.)
In your design you must seriously consider how to balance the workloads across the
processing elements and minimize the communication cost.
You need to discuss how both data and computation are partitioned, how many
computation-and-communication steps are needed, and in each step what computation
is performed and how data are sent/received by each processing element. You may
draw figures and/or write pseudocode to simplify the description of your parallel
algorithm.
2) Calculate the speedup, efficiency and isoefficiency function for your designed parallel
algorithm. Show your work.






END OF EXAMINATION


欢迎咨询51作业君
51作业君 51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: ITCSdaixie