辅导案例-EEE6207-Assignment 20
EEE6207 Assignment 2019/20 – Semester 1 This assignment concerns using multiple threads to speed-up the multiplication of two n × n matrices. 1 Background: Algorithmic performance It seems intuitively reasonable that the runtime of any algorithm should scale as a function of the ‘size’ n of the problem. Consider the case where the runtime T (n) of an algorithm is given by: T (n) = αn3 + βn2 + γn+ δ which can often be derived simply by ‘counting’ the numbers of operations in a block of code. Usually, the details of the constants depend on the processor, etc. and such a detailed analysis is rarely justified. For this reason, computer scientists usually only consider the worst-case analysis of when n tends asymptotically to infinity. To take the example above, as n → ∞, the term in n3 eventually dominates all other terms. Similarly, the value of α tends to be processor-specific and is ignored. This leads to the so-called ‘big oh’ analysis – the example algorithm discussed above would thus be said to beO(n3), or an “n-cubed algorithm”. (Notice that this is a rough-and-ready asymptotic analysis that usually proves to be adequate in practice; for small values of n, it is entirely feasible that an O(n3) algorithm runs faster than an O(n2) algorithm if the values of the respective constants are favourable.) 2 Matrix multiplication If we consider the elementary case of the product of two 2× 2 matrices,A×B = C:[ a11 a12 a21 a22 ] × [ b11 b12 b21 b22 ] = [ c11 c12 c21 c22 ] we can observe that element c11 is given by the (inner or dot) product of the row vector [ a11 a12 ] and the column vector [ b11 b21 ]T . Similarly, the other elements in the product matrix C are formed by vector products between rows and columns inA andB. (If you are not convinced of this, try writing out the individual terms!) Considering the algorithmic complexity of multiplying two n × n matrices, forming a single ele- ment in the product matrix (e.g. c11) requires n multiply-accumulate operations. To generate every element ofC requires n2 such vector product operations meaning that the algorithmic complexity of the whole matrix multiplication is O(n3). If, however, it were possible to perform each of the vector products in parallel on, say, n2 processors then the matrix multiplication will beO(n) – a significantly smaller rate of growth than an O(n3) algorithm and an attractive speed-up. 3 The assignment Nowadays, it is near-universal for the processors in even the cheapest computers to contain multiple cores – each core is effectively a stand-alone processor that shares main memory with all the other cores. Mapping a computational thread onto each core can thus produce significant speed ups for a whole range of algorithms. This assignment will consider amulti-threaded implementation of square matrix multiplication in which each vector product is carried out by a single thread. (In practice, most operating systems internally map one thread to one core – the details of how this is done and any c© P.I.Rockett, 2018 1 complications that result are outside the scope of this assignment. You can, of course, create more threads than there are cores, and leave it to the OS to schedule them.) You should write a program to multiply two (arbitrary) n×nmatrices using multiple threads and measure its runtime performance1. Starting with a 2×2matrix (n = 2) and 4 (= n2) threads, measure how the runtimes scale when you increase the value of n and hence the number of threads. (In order to obtain reliable timings, you may need to repeat the same matrix multiplication multiple times and then average.) • In principle, it should be possible to achieve O(n) performance by using n2 threads although it is highly unlikely will achieve this – what do you observe and why? Can you explain the deviation from O(n) performance you observe? • How does the performance vary with n and the numbers of statically allocated threads used for calculating the vector products? • Is it possible to achieve approximately the sameperformance by reusing a smaller pool of threads (< n2) where each thread carries out a sequence of two or more of the vector multiplications? Can you explain your results? • How does this all relate to the number of physical cores in your machine? And the cache sizes? In this assignment, you should consider only the so-called static thread allocation model where you create a sufficiently large pool of threads before carrying out the matrix multiplications. This is an entirely appropriate approach if, for example, you are going tomultiply a large number of identically- sized matrices, one after the other. Time only the durations of the matrix multiplications. Do NOT include the time it takes to create/destroy threads. Your solution should consider only the times it takes to perform the matrix multiplications, not the operating-system dependent overheads of thread creation and destruction. Depending on how you organise your program, you may experience race hazards, that is, situ- ations where one thread occasionally completes its work before another thread, resulting in the an- swer sometimes being calculated from incomplete results. You need to think carefully about whether you need to employ appropriate synchronisation mechanisms. (Unfortunately, race hazards are a major problem in concurrent programming and have been the root cause of a number of software disasters!) The data structures you use to store the matrices are for you to decide, but I suggest anything more complicated than 2D arrays, such as specialist matrix classes, may cause you problems. FWIW, multiplying two matrices stored in 2D arrays is a first-year C programming exercise. Similarly, for languages that give you the option, passing function parameters by reference rather than by value would be sensible. Sufficient background information for this assignment is given in the printed notes available on the module’s MOLE pages. 4 What is not included • Although sophisticated concurrency platforms are available – probably the most well-known be- ingOpenMP– you shouldnot use such software in this assignment. Hand code all the allocation of vector products to threads. 1You should not make the element values zero in case your compiler ‘optimises away’ what it may think are pointless 0× 0 operations; random values will be fine. You should, of course, check that your program produces the right answer! c© P.I.Rockett, 2018 2 • Youmay come across divide-and-conquer approaches tomatrixmultiplication, such as Strassen’s method; this ingenious algorithm reduces the complexity of matrix multiplication from O(n3) to O(n2.81) on a single processor. For the purposes of this assignment, however, you should ignore such clever approaches, and consider only mapping vector products to threads. • You should not do anything (explicit) to alter theway the operating system schedules the threads. Just rely on the default scheduling mechanisms. 5 Assessment • The choice of development tools and indeed operating system is yours! Many languages and environments (C, C++, Java, etc.) support multi-threading. Python is more problematic: com- mon C-based implementations of Python use a global interpreter lock (GIL) – see https://en. wikipedia.org/wiki/Global_interpreter_lock – which means that such versions of Python are unable to utilise multiple cores and so are unsuitable for this assignment. I suggest you use Python only if you are a Python expert and know how to negotiate these issues. • The report, which should be up to 3 pages in length, should discuss the performance of your algorithm; it should include any relevant graphs. • You should outline how your program works (although detailed documentation of every line of source code is not required). • In addition, you should provide sufficient evidence that your program actually works as spe- cified. You may be required to physically demonstrate your program working. • All source code should be presented as an appendix to your report; the appendices will not count towards the 3-page limit. • The report will contribute up to 12.5% of the module mark. • Marks will be awarded for identifying the key factors that determine the speed-up from using multiple threads, and the appropriateness of the methods/approach employed. The report should be submitted by Friday, 17 January 2020 to the module’s MOLE page. c© P.I.Rockett, 2018 3 Marking Scheme 1. Introduction&demonstration of the base case of performingmultiplicationusing a single thread [10%]. 2. Implementation&demonstration of distributing thematrixmultiplication overN2 threads, con- sidering a large enough range of N to draw sensible conclusions. [15%] 3. Analysis & discussion of the results obtained in (2) above. [15%] 4. Implementation & demonstration of distributing the matrix multiplication over fewer than N2 threads, considering a large enough range of N to draw sensible conclusions. [15%] 5. Analysis & discussion of the results obtained in (4) above. [15%] 6. Overall analysis/discussion of results taking into consideration the characteristics of the pro- cessor. [15%] 7. Drawing appropriate conclusions. [15%] c© P.I.Rockett, 2018 4