1 ECE565/CSCI504 Computer Architecture HW2 Assignment, Due on March 3 Thursday 1. (20 pts) Suppose we have the following cache: Set# Tag L3:W1 L3:W0...

1ECE565/CSCI504 Computer ArchitectureHW2 Assignment, Due on March 3 Thursday1. (20 pts) Suppose we have the following cache:Set#TagL3:W1L3:W0TagL2:W1L2:W0TagL1:W1L1:W0TagL0:W1L0:W000011A1A00001K1K00111M1M011101B1B00001N1N01110P1P0231110C1C01101D1D041110E1E01010Q1Q01011R1R00111F1F0560100H1H01010S1S00111T1T00101J1J070001G1G00101U1U00011V1V01001Y1Y0The columns labeled Li:Wj contain the jth data word for line Li (e.g., L1:W0 contains the 0th word in the 1st line). The cache policies are:Replacement: randomWrite allocate: bypass-cache.Write update: write-through.(a) [4 pts] What is the address of the following data words found in the cache above?Data wordAddressB1C0N0Y1(b) [7 pts] Suppose the addresses in the following trace are accessed in order, possibly changing the cache on each access. The R/W column indicates whether the access was a read or a write access. For each access, determine (1) which set in the cache would be accessed to look for the addressed data, (2) does the cache access for this address access result in a hit or a miss? (3) if there is a miss, what type of miss is it? Assume that if the addresses do not appear in the cache above, the addresses have not been accessed before the start of this trace. Be sure your answer is consistent with the cache’s policies, listed above.R/WAddressSet #Hit/Miss?ConflictMiss?CapacityMiss?CompulsoryMiss?Read11100110Write11110011Read11110010Write11110011Read000000102(c) [4 points] What is the size of the physical (main) memory in words?(d) [5 points] Suppose the cache access time is 10ns and main memory access time is 75ns/word. What is the maximum number of misses out of 1000 total accesses that this cache memory system can have before it exceeds an effective access time of 30ns?2. (12 pts) Cache Performance: For the following cache memory system,ParameterValueNumber of cache lines2048Line size16 words (i.e., 64 bytes)Cache access time2 ns/lineMain memory access time10 ns/wordMain memory address space1GB (i.e., 256Mwords)Cache hit rate95%(a) What is the effective access time of a cache memory system?(b) In case the cache is direct-mapped cache, label the fields of the memory address below used to access the cache and indicate the size of each field (in number of bits). Assume that memory is byte-addressed.______________: ____bits______________: ____bits______________: ____bits(c) In case the cache is 2-way set-associative cache, label the fields of the memory address below used to access the cache and indicate the size of each field (in number of bits). Assume that memory is byte-addressed.______________: ____bits______________: ____bits______________: ____bits(d) In case the cache is fully-associative cache, label the fields of the memory address below used to access the cache and indicate the size of each field (in number of bits). Assume that memory is byte-addressed.______________: ____bits______________: ____bits______________: ____bits33. (9 pts) Cache Design in terms of simple performance consideration.A pipelined processor with a separate instruction and data cache has five stages, a cycle time of 30 ns, and can start a new instruction on every cycle when there are no hazards.It is used with a copy-back (write-back) data cache with a line size of one word, T_cache =30 ns, and T_main = 80 ns. The hit rate in the cache is 90%. In this cache, a missed word is not passed to the processor until the entire line is received from main memory. Ignore write-backs of dirty pages.25% of all instructions are performing LOADs and STORES (i.e., data reads and writes). For this problem, assume all other instructions cause no hazards and the LOAD and STORE instructions only stall because of memory access time.How many stalls occur when a memory access instruction misses in the cache?______________________________________________________________________ stallsWhat is the average instruction throughput (in IPS) of this processor? (show work)________________________________________________________________________________________________________________________________________________ I/secIf T_cache for data cache is increased to 31 ns, what is the NEW instruction throughput (in IPS)?________________________________________________________________________________________________________________________________________________ I/sec44. (12 pts). Matrix Multiplication Cache Performance; R10 running sum; R12 A element value; R3 A element address; R4 A row base (fixed in inner loop); R5 A column offset (variable in inner loop); R16 B element value; R7 B element address; R8 B row base (variable in inner loop); R9 B column offset (fixed in inner loop)main: addi R4, R0, 400 ; set to last columnOuter2: addi R4, R4, -40 ; decrement A's row baseaddi R9, R0, 40 ; set to last element of rowOuter1: addi R9, R9, -4 ; decrement B's column offsetsub R10, R10, R10 ; clear sumaddi R5, R0, 40 ; set to last element of rowaddi R8, R0, 400 ; set to last columnInner: addi R5, R5, -4 ; decrement A's column offsetaddi R8, R8, -40 ; decrement B's row baseadd R3, R4, R5 ; form A's element addressadd R7, R8, R9 ; form B's element addresslw R12, A_base(R3) ; load A's elementlw R16, B_base(R7) ; load B's elementmul R12, R12, R16 ; compute productadd R10, R10, R12 ; sum productsbnez R5, Inner ; loop across all elementsadd R3, R4, R9 ; compute result addresssw C_base(R3), R10 ; store resultbnez R9, Outer1 ; loop across all B's columnsbnez R4, Outer2 ; loop across all A's rowsConsider the 10 by 10 matrix multiplication algorithm used in the MIPS code above. Two 10 by 10 matrices A and B are multiplied leaving the result in C. The system running the application employs a data cache which is initially empty. The data cache has the following properties:cache organizationthe number of sets .......... 1The number of lines/set ..... 1024The number of words/line .... 4T_cache ..................... 20 nST_main ...................... 100 nS per linewrite update policy ......... copy-backwrite allocation policy ..... insert in cachereplacement policy .......... LRU5(a) For this application, what is the expected number of misses of each type?compulsory: _________________________________________________________capacity: ___________________________________________________________conflict: ____________________________________________________________(b) What is the hit rate of the cache? (show work)_________________________________________________________________________ %(c) What is the average data access time (T_effective) in nS? (show work)________________________________________________________________________ nS(d) If the cycle time of a pipelined system is 20 nS, how many stalls areintroduced for each miss? (show work)___________________________________________________________________________5. (20 pts) Caches and Performance: A pipelined processor with a separate instruction and data cache has five stages: instruction fetch, decode, execute, memory (data load or store), and write back to register file. Its cycle time is 20 ns and it can start a new instruction on every cycle when there are no hazards. The word size of the processor is 32 bits.The instruction cache is fully associative, with a total capacity of 2KB (2048 bytes) and a line size of 2 words. The instruction cache access time is 20 ns.The data cache is 2-way set associative with a line size of 4 words and a total capacity of 2KB (2048 bytes). The data cache access time is 20 ns.Main memory access time is 80 ns/word. In both caches, a missed word is not passed to the processor until the entire line is received from main memory.Parts a, b and c are each independent of each other.(a) (9 pts) The processor is executing a sequential code block containing 400 nonbranching instructions and each instruction is 1 word long. The instructions are stored contiguously in memory. The instruction cache is empty at the start of the code execution. For this part (a) of the problem, assume the hit rate of the data cache during the execution of this code is 100%. Also, assume no hazards occur in the code fragment; stalls only occur due to memory access time.(a.1) When an instruction is not found in the instruction cache how many stall cycles occur (per instruction)?_____________________________________ stall cycles.6(a.2) What is the instruction cache's hit rate while executing this code block?____________________________________ % hit rate.(a.3) What is the average instruction throughput of this processor executing this code block? Express your answer in cycles per instruction._____________________________________ CPI.(b) (3 pts) How many sets does the data cache contain?_____________________________________ sets.(c) (8 pts) Assume we have another fragment of code which is preloaded into the instruction cache so that there are no instruction cache misses. Assume the 2-way set associative data cache has a hit rate of 90%. Suppose we design another version of this machine in which the data cache is replaced with a fully associative data cache of the same size (2KB), having the same line size, with a hit rate of 95%? Assume that in both cases the data cache is initially empty and the clock rate remains the same.What is the instruction throughput in cycles/instruction for the original machine (with the 2-way set associative cache) and for the new machine (with the fully associative cache)?______________________________CPI (machine with 2-way set-associative cache)______________________________CPI (machine with fully associative cache)What is the speedup (in percent) of the machine with the fully associative cache over the one with the set associative cache?______________________________ % speedup6. (10 pts) Consider three computers A, B, and C run on three programs 1, 2, and 3 with the following times:Computer AComputer BComputer CProgram 150100200Program 210010010Program 39003001000a) For each of the three computers, compute the arithmetic mean and geometric mean.b) Normalize the times in the above chart relative to Computer A, and then compute the normalized arithmetic mean and the normalized geometric mean for Computer B and Computer C.77. (10 pts) Imagine a machine that has three instruction types, A, B, C, and imagine two different test programs. The instruction mix for these two programs is:Program 1Program 2A20%40%B20%40%C60%20%Imagine further that instructions A and B currently require 3 clock cycles each, instruction C currently requires 5 clock cycles, and the machine’s clock cycle time is 2 ns. Suppose that you have come up with a design change that will allow instruction C to execute in 3 clock cycles and another, much more complex, design change that will allow instruction C to operate in 2 clock cycles.a) How much faster in terms of speedup will each of the two programs run if you choose the simple design change (i.e., instruction C goes from 5 clock cycles to 3 clock cycles)?b) How much faster in terms of speedup will each of the two programs run if you choose the complex design change (i.e., instruction C goes from 5 clock cycles to 2 clock cycles)?8. (20 pts) Your company has a benchmark that is considered representative of your typical applications. An embedded processor under consideration to support your task does not have a floating-point unit and must emulate each floating-point instruction by a sequence of integer instructions. This processor is rated at 120 MIPS on the benchmark. A third-party vendor offers a compatible coprocessor to boost performance. That coprocessor executes each floating point instruction in hardware (i.e., no emulation is necessary). The processor/coprocessor combination rates 80 MIPS on the same benchmark. The following symbols are used to answer parts (a)-(e) of this exercise:I – Number of integer instructions executed on the benchmarkF – Number of floating-point instructions executed on the benchmark.Y – Number of integer instructions to emulate on floating-point instruction.W – Time to execute the benchmark on the processor alone.B – Time to execute the benchmark on the processor/coprocessor combination.a) Write an equation for the MIPS rating of each configuration using the symbols above.b) For the configuration without the coprocessor, we measure that F = 8 X 106, Y = 50, and W = 4 seconds. Find I.c) What is the value of B?d) What is the MFLOPS rating of the system with the coprocessor?e) Your colleague wants to purchase the coprocessor even though the MIPS rating for the configuration using the coprocessor is less than that of the processor alone. Is your colleague’s evaluation correct? Defend your answer.89. (5 pts) Which improvement gives a greater reduction in execution time: one that is used 30 percent of the time but improves performance by a factor of 2 when used, one that is used 80 percent of the time but only improves performance by a factor of 1.3 when used?10. (10 pts) A computer architect is designing the memory system for the next version of a processor. If the current version of the processor spends 50 percent of its time processing memory references, by how much must the architect speed up the memory system to achieve an overall speedup of 1.4? A speed up of 1.8?11. (15 pts) Suppose that when a certain processor is run on an intense mathematical benchmark, 50% of the instructions are floating-point instruction, including 10% which compute a square root (i.e. 40% of the instructions are floating-point instructions that do not computer square root). Assume that the floating-point square root instruction has an average CPI of 10, the other floating-point instructions (not including the square root instruction) have an average CPI of 3, and the average CPI of non-floating-point instructions is 1.5.a) What is the overall CPI when the processor is run on the benchmark?b) What is the average CPI for all floating-point instructions (including the floating-point square root instruction)?c) Suppose you have a choice between speeding up all floating-point instructions by a factor of three versus speeding up the floating-point square root instruction by a factor of 100. What is the overall speedup resulting from each possible choice? Assuming that the two choices require equal effort, which is the better choice?12. (24 pts) A program executes the following mixture of instructions:Fractions of instructions executedInstruction TypeExecution Time(clock cycles)0.3Loads and stores100.4Integer ALU10.2FP ALU400.1Branch2a) What is the average CPI for this program?b) What is the average MIPS rate for this program? (Assume clock period is 1ns.)c) You decide to design a new version of this processor with a faster floating-point ALU to improve the processor's performance when executing this program. What speedup would you expect to see if you could reduce the number of cycles required to execute all floating-point instructions by 5 clock cycles?d) What is the maximum number of clock cycles each floating point operation can take to obtain an overall speedup for this program of 7 percent?e) What is the theoretical limit to the speedup you could obtain by improving the time required to execute floating-point instructions?f) To obtain the biggest performance improvement for the amount of effort expended in redesigning this processor, on which one of the instruction types (load/store, floating point, integer ALU, or branch) should you concentrate your effort? Quantitatively justify your answer.