代写辅导接单-Max Points: 100 Part-A: CUDA Matrix Operations

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Max Points: 100

Part-A: CUDA Matrix Operations

The purpose of this exercise is for you to learn how to write programs using the CUDA programming interface, how to run such programs using an NVIDIA graphics processor, and how to think about the factors that govern the performance of programs running within the CUDA environment. For this assignment, you will modify two provided CUDA kernels. The first is part of a program that performs vector addition. The second is part of a program to perform matrix multiplication. The provided vector addition program does not coalesce memory accesses. You will modify it to coalesce memory access. You will modify the matrix multiplication program to investigate its performance and how it can be optimized by changing how the task is parallelized. You will create an improved version of the matrix multiplication program and empirically test the time it takes to run. You will analyze the results of your timing tests.

You will turn in all code you have written and used for this assignment, a makefile that will compile your code, and a standard lab report documenting your experiments and results. We need to be able to compile your code by executing ”make”.

${NVCC} $< −c −o $@ $(OPTIONS)

vecadd00 : vecadd . cu vecaddKernel . h vecaddKernel00 . o timer . o

${NVCC} $< vecaddKernel00.o −o $@ $(LIB) timer.o $(OPTIONS) You can use these commands in the batch file:

cd path/to/PartA make clean

make

./ vecadd00 arguments ./matmul00 arguments

Problem-1: Vector add and coalescing memory access

You are provided with vecadd, a micro benchmark to determine the effectiveness of coalescing. The provided code is a non-coalescing version, and it is your job to create a coalescing version and to measure the difference in the performance of the two.

5 points

Here is a set of files provided in PartA.zip for this problem • Makefile

• vecadd.cu

• vecaddKernel.h

• vecaddKernel00.cu • timer.h

• timer.cu

Compile and run the provided program as vecadd00 (using the kernel vecaddKernel00) and collect data on the time the program takes with the following number of values per thread: 500, 1000, 2000.

Include a short comment in your make file describing what your various programs do, and fully describe and analyze the experiments you performed in your lab report.

Q2 10 points

In vecadd01 you will use your new kernel, vecAddKernel01.cu, to perform vector addi- tion using coalesced memory reads. Change the Makefile to compile a program called vecadd01 using your kernel rather than the provided kernel. Modify the Makefile as appropriate so the clean and tar commands will deal with any files you have added.

Test your new program and compare the time it takes to perform the same work performed by the original. Note and analyze your results, and document your observations in the report.

Problem-2: Shared CUDA Matrix Multiply

For the next part of this assignment, you will use matmultKernel00 as a basis for another kernel with which you will investigate the performance of matrix multiplication using a GPU and the CUDA programming environment. Your new kernel should be called matmultKernel01.cu. Your code files must include internal documentation on the nature of each program, and your README file must include notes on the nature of the experiment. You can have intermediate stages of this new kernel, eg without and with unrolling, but we will grade you on your final improved version of the code and its documentation.

Page 2

Here is a set of provided files in PartA.zip for this problem: • Makefile

• matmult.cu

• matmultKernel.h

• matmultKernel00.cu

• timer.h

• timer.cu

In both the questions of this problem (Q3 and Q4 below), you should investigate how each of the following factors influences the performance of matrix multiplication in CUDA:

1. The size of matrices to be multiplied

2. The size of the block computed by each thread block

3. Any other performance enhancing modifications you find

Q3 8 points

First you should time the initial code (with the provided kernel matmultKernel00) using square matrices of each of the following sizes: 256, 512, 1024.

When run with a single parameter, the provided code multiplies that parameter by FOOTPRINT SIZE (set to 16 in matmult00) and creates square matrices of the resulting size. This was done to avoid nasty padding issues: you always have data blocks perfectly fitting the grid.

Q4 12 points

In your new kernel each thread should compute four values in the resulting C block rather than one. So now the FOOTPRINT SIZE becomes 32. ( Notice that this is taken care of by the Makefile.) You will time the execution of this new program with ma- trices of the sizes listed above to document how your changes affect the performance of the program. Provide the speedup that your new kernel achieves over the provided kernel.

To get good performance, you will need to be sure that the new program coalesces its reads and writes from global memory. You will also need to unroll any loops that you might be inclined to insert into your code.

Q5 5 points

Can you formulate any rules of thumb for obtaining good performance when using CUDA? Answer this question in your conclusions.