程序辅导案例 > C/C++ >

程序代写接单-COMP3370

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

代写程序接单

The University of Manitoba Computer Science Department Course: COMP3370 Computer Organization Term Work: Assignment4

Date: April 18, 2022 Instructor: Dr. Eskicioglu Pages: Page 1 of 4 Due: 11:59pm; April 26, 2022 Instructions There are 7 questions on this Assignment for a total of 50 points. Read academic misconduct information discussed in ROASS document. Note that assignments are to be done independently unless otherwise explicitly stated and that inclusion of materials from online sites is strictly forbidden. Show all your work clearly in your answers to all questions for full marks. Nohandwrittensubmissionswillbeaccepted.Usingasimpletexteditorwillnotproducethedesired output. Use a word processor of your choice for your answers, then convert it to a single PDF file before uploading. Make sure your name and student number are on each page of your file. You must hand in your assignment electronically via UMLearn using the Assignments tab under the Assessments drop-down menu. A folder for Assignment 4 has been created there. Recall that you must agree to the online honesty document before the submission folder becomes visible to you. Youmayuploadmultipletimes,ifsodesiredandencouraged,butonlythefinaluploadwillbesavedand will be visible to us. So, make sure that youre submitting the correct copy every time. Start NOW!, dont leave it to the last minute... 4 1. Consider the following loop. 8 2. LOOP: LDUR LDUR X11, [X1, #8] ADD X12 , X10 , X11 SUBI X1, X1, #16 CBNZ X12 , LOOP Assume (i) that perfect branch prediction is used (no stalls due to control hazards); (ii) that there are no delay slots; (iii) that the pipeline has full forwarding support; and (iv) that branches are resolved in the EX (as opposed to the ID) stage. (a) Show a pipeline execution diagram for the first two iterations of this loop. (b) Mark pipeline stages that do not perform useful work. How often while the pipeline is full do we have a cycle in which all five pipeline stages are doing useful work? (Begin with the cycle during which the SUBI is in the IF stage and end with the cycle during which the CBNZ is in the IF stage.) Consider a program with the following cache behaviors. 250 100 0.30% 2% 64 (a) Suppose a CPU with a write-through, write-allocate cache achieves a CPI of 2. What are the read and write bandwidths (measured by bytes per cycle) between RAM and the cache? (Assume each miss generates a request for one block.) (b) Forawrite-back,write-allocatecache,assuming30%ofreplaceddatacacheblocksaredirty,whatarethereadand write bandwidths needed for a CPI of 2? (c) Do additional calculations to (separately) demonstrate the changes in the bandwidth if we Double the DC miss rate, and Reduce the IC rate to half. Consider the following instruction sequence, running on a 5-stage pipeline datapath: ADD X5, X2, X1 LDUR X3, [X5, #4] LDUR X2, [X2, #0] ORR X3, X5, X3 STUR X3, [X5, #0] (a) If there is no forwarding or hazard detection, insert NOPs to ensure correct execution. (b) Now, change and/or rearrange the code to minimize the number of NOPs needed. You can assume register X7 can be used to hold temporary values in your modified code. (c) If the processor has forwarding, but we forgot to implement the hazard detection unit, what happens when the original code executes? (d) Ifthereisforwarding,forthefirstsevencyclesduringtheexecutionofthiscode,specifywhichsignalsareasserted in each cycle by hazard detection and forwarding units in figure below: (e) If there is no forwarding, what new input and output signals do we need for the hazard detection unit in the above figure? Using this instruction sequence as an example, explain why each signal is needed. Althoughacacheisnamed,byconvention,accordingtotheamountofdataitholds(e.g,a4KiBcachecanhold4KiBof data), caches also require SRAM to store metadata such as tags and valid bits. In the following questions, you will examine how a caches configuration affects the total amount of SRAM needed to implement it as well as the performance of the cache. Assume that the caches are byte addressable, and that addresses and words are 64 bits. (a) Calculate the total number of bits required to implement a 32 KiB cache with 2-word blocks. (b) Calculate the total number of bits required to implement a 64 KiB cache with 16-word blocks. How much bigger is this cache than the 32 KiB cache described in the previous question? Why the amount of data can be increased by only increasing the block size? X10, [X1, #0] Data Reads per 1000 Instructions Data Writes per 1000 Instructions Instruction Cache Miss Rate Data Cache Miss Rate Block Size (bytes) 5 3. 8 4. Page 2 of 4 (c) Explain why the above 64 KiB cache, despite its larger data size, might provide slower performance than the first cache. (d) Generate a series of read requests that have a lower miss rate on a 32 KiB 2-way set associative cache than on the cache described above?. 9 5. The effects of different cache designs vary substantially. Given the following sequence of word addresses: 0x03, 0xb4, 0x2b, 0x02, 0xbe, 0x58, 0xbf, 0x0e, 0x1f, 0xb5, 0xbf, 0xba, 0x2e, 0xce (a) Using the following diagrams style, sketch the organization of a 3-way set associative cache with 2-word blocks and a total size of 48 words. (b) Trace the behavior of the above cache. Assume a true LRU replacement policy. For each reference, identify the binary word address, the tag, the index, the offset whether the reference is a hit or a miss, and which tags are in each way of the cache after the reference has been handled. (c) Trace the behavior of a fully associative cache, with 1-word blocks and total size of 8 words. Assume a true LRU replacement policy. For each reference, identify the binary word address, the tag, the index, the offset whether the reference is a hit or a miss, and the contents of the cache after each reference has been handled. 9 6. Cachesareimportanttoprovidingahigh-performancememoryhierarchytoprocessors.Belowisalistof64-bitmemory address references, given as word addresses. 0x03, 0xb4, 0x2b, 0x02, 0xbf, 0x58, 0xbe, 0x0e, 0xb5, 0x2c, 0xba, 0xfd Page 3 of 4 7 7. (a) For each of these references, identify the binary word address, the tag, and the index given a direct-mapped cache with 16 one-word blocks. Also list whether each reference is a hit or a miss. (b) For each of these references, identify the binary word address, the tag, the index, and the offset given a direct- mapped cache with two-word blocks and a total size of eight blocks. Also list if each reference is a hit or a miss. (c) Youreaskedtooptimizeacachedesignforacertainsetofreferences.Therearethreedirect-mappedcachedesigns, all with a total of 8 words of data: C1 has 1-word blocks, C2 has 2-word blocks, and C3 has 4-word blocks. A number of benchmarking sites are maintained by individuals, groups and organizations besides SPEC. Among these are a site that maintains a substantial collection of benchmarks at http://www.roylongbottom.org.uk/. While many of the benchmarks there are for windows-based machines, a subset is also maintained for assessing the performance of processors running on Linux. You will use one of the programs in this benchmark suite, linpack, as the basis for an exercise in the practice of bench- marking. This will involve you getting the code, doing the benchmark runs and reporting some basic results. Specifically, you are to run the linpack_64 and linpack_64_NoOpt benchmarks using a machine in the Linux Lab. Dont forget to do multiple runs and report averages when collecting your data. You should use the Linux command time ./linpack_64 n to run the compiler optimized benchmark and the Linux command time ./linpack_64_NoOpt n to run the compiler non-optimized benchmark. To get the benchmark code, locate the string Benchmarks Compiled for Linux on the left side of the page and click on See Details beneath it. Once on the resulting page, click on Classic Benchmarks in the table at the top of the page. This will scroll you down to the relevant section of the page and near the bottom of that section you will find a link titled classic_benchmarks.tar.gz which you can use to download the needed files. If you are unfamiliar with installations on nix machine, you can simply gunzip this file to extract the tar file and then use the tar command to extract the benchmark files. (The online manual pages for these commands are available using the man command as in, for example, man gunzip.) Do this in a sub-directory in your home directory. You only need to keep the executable named linpack64 from the directory named bin64. The rest can be discarded if you like or to save disk space. There will be a README file in what you download that you should read to find out some about the benchmark code. The source code is available as well if you should be interested. You are to report the unrolled double precision MFLOPS (Millions of Floating Point Operations Per Second) score for each run (optimized and non-optimized) along with the user CPU time (reported by the time command) in both a table and a simple descriptive graph/chart. (You can use Excel charts for this, a useful skill to have by the way). Note that each run reports both to the screen and into a file that you can rename and save for easy reference. Briefly summarize what you see in the results including averages and standard deviations for the runs in each of the optimized and non- optimized sets. Do a little background reading on compiler optimization flags for the gcc compiler so you can provide at least a general explanation of the difference in MFLOPS ratings for the optimized and non-optimized runs. Find out what unrolling/unravelling means (we will see this again later in the course) and provide a simple example along with a brief explanation of why it is probably done. (You can look for unroll in the source code to see what is happening if you like.) Do this question early and at a time when the load on the machine you are using for testing is low. You should use the Linux uptime command to discover and report the load on the machine when you are making your runs. You are free to make good use of the UMLearn discussion group to get a better understanding of what is needed without sharing any results. Obviously, there will be no single correct answer to this question since different students will receive somewhat different results due to loading conditions on the test machine selected in the Linux lab. It is more about how you approach the problem to report valid data and how you interpret the data that you collect. Certainly, do describe any important features of your benchmarking in your answer to the question. Since you will not have written any code, there is no need to hand in the benchmark code or the instructions on its use. Answer each of the following, assuming that the cache is initially empty: Page 4 of 4