Separable 2D Convolution
The problem, worth 20 points, is due June 10, 2020, via BBLearn. You may work in a team of up
to two people. One submission per group will suffice. Please submit original work.
Convolution filtering is used in a wide range of image processing tasks, including smoothing, shar-
pening, and edge detection. This assignment asks you to implement a 2D separable convolution
filter on the GPU.
A convolution filter is a scalar product of the filter weights with the input pixels within a window
surrounding each of the corresponding output pixels. More specifically, given a vector s and a
convolution kernel k of size n, the convolution r(i) of the ith element of the vector is given by
r(i) =

n
s(i− n)k(n).
The elements at the boundaries, that is, elements that are “outside” the vector s are treated as if they
had the value zero. Convolution can be easily extended into two dimensions by adding indices for
the second dimension. For example, given a 2D matrix s and a convolution filter k of dimension
n×m, where n is the width and m is the height of the filter, the convolution of element (i, j) in s
is given by ∑
n

m
s(i− n, j −m)k(n,m).
The above operation should be quite familiar to you, having discussed 1D convolution in lecture.
A 2D convolution filter requires n ×m multiplications for each output element. Separable filters
are a special type of filter that can be expressed as the composition of two 1D filters, one on the
rows of the matrix and one on the columns of the matrix. So, a separable filter can be divided into
two consecutive 1D convolution operations on the data, row-wise and column-wise, requiring only
n+m multiplications for each output element.
The program provided to you accepts the number of rows and columns of the input matrix as
command-line parameters. It creates a random Gaussian kernel of specified width as well as an
input matrix consisting of random floating-point values. A CPU implementation of separable con-
volution generates a reference solution which is compared with your GPU program’s output.
1
Edit the compute on device() function within the file separable convolution.cu to
complete the functionality of 2D separable convolution on the GPU. To achieve this functionality,
you may add multiple kernels to the separable convolution kernel.cu file.
This assignment will be graded on the following parameters:
• (10 points) You may just use GPU global memory to get the program working correctly
without any performance-related optimizations.
• (10 points) Improve performance of your code using the GPU’s memory hierarchy — for
example, shared memory, constant memory — efficiently to maximize the speedup over the
serial version.
Submit the source files needed to build and run your code as a single zip file via BBLearn. Provide
a short report describing how you designed your GPU code, using code or pseudocode if that helps
the discussion, and the amount of speedup obtained over the serial version for input-matrix sizes
of 2048 × 2048, 4096 × 4096, and 8192 × 8192 elements. Also, characterize the sensitivity of
the GPU kernels to thread-block size in terms of the execution time. Ignore the overhead due to
CPU/GPU communication when reporting the performance.
2  Email:51zuoyejun

@gmail.com