辅导案例-EEE6207-Assignment 20

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

EEE6207 Assignment 2019/20 – Semester 1
This assignment concerns using multiple threads to speed-up the multiplication of two n × n
matrices.
1 Background: Algorithmic performance
It seems intuitively reasonable that the runtime of any algorithm should scale as a function of the
‘size’ n of the problem. Consider the case where the runtime T (n) of an algorithm is given by:
T (n) = αn3 + βn2 + γn+ δ
which can often be derived simply by ‘counting’ the numbers of operations in a block of code.
Usually, the details of the constants depend on the processor, etc. and such a detailed analysis
is rarely justified. For this reason, computer scientists usually only consider the worst-case analysis
of when n tends asymptotically to infinity. To take the example above, as n → ∞, the term in n3
eventually dominates all other terms. Similarly, the value of α tends to be processor-specific and is
ignored. This leads to the so-called ‘big oh’ analysis – the example algorithm discussed above would
thus be said to beO(n3), or an “n-cubed algorithm”. (Notice that this is a rough-and-ready asymptotic
analysis that usually proves to be adequate in practice; for small values of n, it is entirely feasible that
an O(n3) algorithm runs faster than an O(n2) algorithm if the values of the respective constants are
favourable.)
2 Matrix multiplication
If we consider the elementary case of the product of two 2× 2 matrices,A×B = C:[
a11 a12
a21 a22
]
×
[
b11 b12
b21 b22
]
=
[
c11 c12
c21 c22
]
we can observe that element c11 is given by the (inner or dot) product of the row vector
[
a11 a12
]
and
the column vector
[
b11 b21
]T . Similarly, the other elements in the product matrix C are formed by
vector products between rows and columns inA andB. (If you are not convinced of this, try writing
out the individual terms!)
Considering the algorithmic complexity of multiplying two n × n matrices, forming a single ele-
ment in the product matrix (e.g. c11) requires n multiply-accumulate operations. To generate every
element ofC requires n2 such vector product operations meaning that the algorithmic complexity of
the whole matrix multiplication is O(n3). If, however, it were possible to perform each of the vector
products in parallel on, say, n2 processors then the matrix multiplication will beO(n) – a significantly
smaller rate of growth than an O(n3) algorithm and an attractive speed-up.
3 The assignment
Nowadays, it is near-universal for the processors in even the cheapest computers to contain multiple
cores – each core is effectively a stand-alone processor that shares main memory with all the other
cores. Mapping a computational thread onto each core can thus produce significant speed ups for a
whole range of algorithms. This assignment will consider amulti-threaded implementation of square
matrix multiplication in which each vector product is carried out by a single thread. (In practice, most
operating systems internally map one thread to one core – the details of how this is done and any
c© P.I.Rockett, 2018 1
complications that result are outside the scope of this assignment. You can, of course, create more
threads than there are cores, and leave it to the OS to schedule them.)
You should write a program to multiply two (arbitrary) n×nmatrices using multiple threads and
measure its runtime performance1. Starting with a 2×2matrix (n = 2) and 4 (= n2) threads, measure
how the runtimes scale when you increase the value of n and hence the number of threads. (In order
to obtain reliable timings, you may need to repeat the same matrix multiplication multiple times and
then average.)
• In principle, it should be possible to achieve O(n) performance by using n2 threads although
it is highly unlikely will achieve this – what do you observe and why? Can you explain the
deviation from O(n) performance you observe?
• How does the performance vary with n and the numbers of statically allocated threads used for
calculating the vector products?
• Is it possible to achieve approximately the sameperformance by reusing a smaller pool of threads
(< n2) where each thread carries out a sequence of two or more of the vector multiplications?
Can you explain your results?
• How does this all relate to the number of physical cores in your machine? And the cache sizes?
In this assignment, you should consider only the so-called static thread allocation model where
you create a sufficiently large pool of threads before carrying out the matrix multiplications. This is an
entirely appropriate approach if, for example, you are going tomultiply a large number of identically-
sized matrices, one after the other. Time only the durations of the matrix multiplications. Do NOT
include the time it takes to create/destroy threads. Your solution should consider only the times
it takes to perform the matrix multiplications, not the operating-system dependent overheads of
thread creation and destruction.
Depending on how you organise your program, you may experience race hazards, that is, situ-
ations where one thread occasionally completes its work before another thread, resulting in the an-
swer sometimes being calculated from incomplete results. You need to think carefully about whether
you need to employ appropriate synchronisation mechanisms. (Unfortunately, race hazards are a
major problem in concurrent programming and have been the root cause of a number of software
disasters!)
The data structures you use to store the matrices are for you to decide, but I suggest anything
more complicated than 2D arrays, such as specialist matrix classes, may cause you problems. FWIW,
multiplying two matrices stored in 2D arrays is a first-year C programming exercise. Similarly, for
languages that give you the option, passing function parameters by reference rather than by value
would be sensible.
Sufficient background information for this assignment is given in the printed notes available on
the module’s MOLE pages.
4 What is not included
• Although sophisticated concurrency platforms are available – probably the most well-known be-
ingOpenMP– you shouldnot use such software in this assignment. Hand code all the allocation
of vector products to threads.
1You should not make the element values zero in case your compiler ‘optimises away’ what it may think are pointless
0× 0 operations; random values will be fine. You should, of course, check that your program produces the right answer!
c© P.I.Rockett, 2018 2
• Youmay come across divide-and-conquer approaches tomatrixmultiplication, such as Strassen’s
method; this ingenious algorithm reduces the complexity of matrix multiplication from O(n3)
to O(n2.81) on a single processor. For the purposes of this assignment, however, you should
ignore such clever approaches, and consider only mapping vector products to threads.
• You should not do anything (explicit) to alter theway the operating system schedules the threads.
Just rely on the default scheduling mechanisms.
5 Assessment
• The choice of development tools and indeed operating system is yours! Many languages and
environments (C, C++, Java, etc.) support multi-threading. Python is more problematic: com-
mon C-based implementations of Python use a global interpreter lock (GIL) – see https://en.
wikipedia.org/wiki/Global_interpreter_lock – which means that such versions of Python
are unable to utilise multiple cores and so are unsuitable for this assignment. I suggest you use
Python only if you are a Python expert and know how to negotiate these issues.
• The report, which should be up to 3 pages in length, should discuss the performance of your
algorithm; it should include any relevant graphs.
• You should outline how your program works (although detailed documentation of every line
of source code is not required).
• In addition, you should provide sufficient evidence that your program actually works as spe-
cified. You may be required to physically demonstrate your program working.
• All source code should be presented as an appendix to your report; the appendices will not
count towards the 3-page limit.
• The report will contribute up to 12.5% of the module mark.
• Marks will be awarded for identifying the key factors that determine the speed-up from using
multiple threads, and the appropriateness of the methods/approach employed.
The report should be submitted by Friday, 17 January 2020 to the module’s MOLE page.
c© P.I.Rockett, 2018 3
Marking Scheme
1. Introduction&demonstration of the base case of performingmultiplicationusing a single thread [10%].
2. Implementation&demonstration of distributing thematrixmultiplication overN2 threads, con-
sidering a large enough range of N to draw sensible conclusions. [15%]
3. Analysis & discussion of the results obtained in (2) above. [15%]
4. Implementation & demonstration of distributing the matrix multiplication over fewer than N2
threads, considering a large enough range of N to draw sensible conclusions. [15%]
5. Analysis & discussion of the results obtained in (4) above. [15%]
6. Overall analysis/discussion of results taking into consideration the characteristics of the pro-
cessor. [15%]
7. Drawing appropriate conclusions. [15%]
c© P.I.Rockett, 2018 4