StudyDaddy Science

Waiting for answer This question has not been answered yet. You can hire a professional tutor to get the answer.

QUESTION

Apr 28, 2018

FOR EACH PAPER. Summary: Problem paper is trying to solve, key ideas/insights, mechanism, implementation. You will include key results and...

FOR EACH PAPER.

1. Summary: Problem paper is trying to solve, key ideas/insights, mechanism, implementation. You will include key results and implementations.

2. Strenghts: Most important ones, does it solve the problem well?

Coming Challenges in Microarchitecture and

Architecture

RONNY RONEN, SENIOR MEMBER, IEEE, AVI MENDELSON, MEMBER, IEEE, KONRAD LAI,

SHIH-LIEN LU, MEMBER, IEEE, FRED POLLACK, AND JOHN P. SHEN, FELLOW, IEEE

Invited Paper

In the past several decades, the world of computers and

especially that of microprocessors has witnessed phenomenal

advances. Computers have exhibited ever-increasing performance

and decreasing costs, making them more affordable and, in turn,

accelerating additional software and hardware development

that fueled this process even more. The technology that enabled

this exponential growth is a combination of advancements in

process technology, microarchitecture, architecture, and design

and development tools. While the pace of this progress has been

quite impressive over the last two decades, it has become harder

and harder to keep up this pace. New process technology requires

more expensive megafabs and new performance levels require

larger die, higher power consumption, and enormous design and

validation effort. Furthermore, as CMOS technology continues

to advance, microprocessor design is exposed to a new set of

challenges. In the near future, microarchitecture has to consider

and explicitly manage the limits of semiconductor technology, such

as wire delays, power dissipation, and soft errors. In this paper,

we describe the role of microarchitecture in the computer world,

present the challenges ahead of us, and highlight areas where

microarchitecture can help address these challenges.

Keywords—Design tradeoffs, microarchitecture, microarchitecture

trends, microprocessor, performance improvements, power issues,

technology scaling.

I. INTRODUCTION

Microprocessors have gone through significant changes

during the last three decades; however, the basic computational

model has not changed much. A program consists of

instructions and data. The instructions are encoded in a specific

instruction set architecture (ISA). The computational

Manuscript received January 1, 2000; revised October 1, 2000.

R. Ronen and A. Mendelson are with the Microprocessor Research Laboratories,

Intel Corporation, Haifa 31015, Israel.

K. Lai and S.-L. Lu are with the Microprocessor Research Laboratories,

Intel Corporation, Hillsboro, OR 97124 USA.

F. Pollack and J. P. Shen are with the Microprocessor Research Laboratories,

Intel Corporation, Santa Clara, CA 95052 USA

Publisher Item Identifier S 0018-9219(01)02069-2.

model is still a single instruction stream, sequential execution

model, operating on the architecture states (memory and

registers). It is the job of the microarchitecture, the logic, and

the circuits to carry out this instruction stream in the "best"

way. "Best" depends on intended usage—servers, desktop,

and mobile—usually categorized as market segments. For

example, servers are designed to achieve the highest performance

possible while mobile systems are optimized for best

performance for a given power. Each market segment has different

features and constraints.

A. Fundamental Attributes

The key metrics for characterizing a microprocessor include:

performance, power, cost (die area), and complexity.

Performance is measured in terms of the time it takes

to complete a given task. Performance depends on many

parameters such as the microprocessor itself, the specific

workload, system configuration, compiler optimizations,

operating systems, and more. A concise characterization of

microprocessor performance was formulated by a number

of researchers in the 1980s; it has come to be known as the

"iron law" of central processing unit performance and is

shown below

Performance Execution Time

IPC Frequency Instruction Count

where is the average number of instructions completed

per cycle, is the number of clock cycles per

second, and is the total number of

instructions executed. Performance can be improved by

increasing IPC and/or frequency or by decreasing instruction

count. In practice, IPC varies depending on the environment—

the application, the system configuration, and more.

Instruction count depends on the ISA and the compiler

used. For a given executable program, where the instruction

PROCEEDINGS OF THE IEEE, VOL. 89, NO. 3, MARCH 2001 325

stream is invariant, the relative performance depends only on

IPC Frequency. Performance here is measured in million

instructions per second (MIPS).

Commonly used benchmark suites have been defined to

quantify performance. Different benchmarks target different

market segments, such as SPEC [1] and SysMark [2]. A

benchmark suite consists of several applications. The time

it takes to complete this suite on a certain system reflects the

system performance.

Power is energy consumption per unit time, in watts.

Higher performance requires more power. However, power

is constrained due to the following.

• Power density and Thermal: The power dissipated by

the chip per unit area is measured in watts/cm . Increases

in power density causes heat to generate. In

order to keep transistors within their operating temperature

range, the heat generated has to be dissipated

from the source in a cost-effective manner. Power density

may soon limit performance growth due to thermal

dissipation constraints.

• Power Delivery: Power must be delivered to a very

large scale integration (VLSI) component at a prescribed

voltage and with sufficient amperage for the

component to run. Very precise voltage regulator/transformer

controls current supplies that can vary within

nanoseconds. As the current increases, the cost and

complexity of these voltage regulators/transformers

increase as well.

• Battery Life: Batteries are designed to support a certain

watts hours. The higher the power, the shorter the

time that a battery can operate.

Until recently, power efficiency was a concern only in battery

powered systems like notebooks and cell phones. Recently,

increased microprocessor complexity and frequency

have caused power consumption to grow to the level where

power has become a first-order issue. Today, each market

segment has its own power requirements and limits, making

power limitation a factor in any new microarchitecture. Maximum

power consumption is increased with the microprocessor

operating voltage ( ) and frequency (Frequency) as

follows:

where is the effective load capacitance of all devices and

wires on the microprocessor.Within some voltage range, frequency

may go up with supply voltage (

). This is a good way to gain performance,

but power is also increased (proportional to ). Another

important power related metric is the energy efficiency. Energy

efficiency is reflected by the performance/power ratio

and measured in MIPS/watt.

Cost is primarily determined by the physical size of the

manufactured silicon die. Larger area means higher (even

more than linear) manufacturing cost. Bigger die area usually

implies higher power consumption and may potentially

imply lower frequency due to longer wires. Manufacturing

yield also has direct impact on the cost of each microprocessor.

Complexity reflects the effort required to design, validate,

and manufacture a microprocessor. Complexity is affected by

the number of devices on the silicon die and the level of aggressiveness

in the performance, power and die area targets.

Complexity is discussed only implicitly in this paper.

B. Enabling Technologies

The microprocessor revolution owes it phenomenal

growth to a combination of enabling technologies: process

technology, circuit and logic techniques, microarchitecture,

architecture (ISA), and compilers.

Process technology is the fuel that has moved the entire

VLSI industry and the key to its growth. A new process generation

is released every two to three years. A process generation

is usually identified by the length of a metal-oxidesemidconductor

gate, measured in micrometers (10 m, denoted

as m). The most advanced process technology today

(year 2000) is 0.18 m [3].

Every new process generation brings significant improvements

in all relevant vectors. Ideally, process technology

scales by a factor of 0.7 all physical dimensions of devices

(transistors) and wires (interconnects) including those vertical

to the surface and all voltages pertaining to the devices

[4]. With such scaling, typical improvement figures are the

following:

• 1.4-1.5 times faster transistors;

• two times smaller transistors;

• 1.35 times lower operating voltage;

• three times lower switching power.

Theoretically, with the above figures, one would expect potential

improvements such as the following.

• Ideal Shrink: Use the same number of transistors to

gain 1.5 times performance, two times smaller die, and

two times less power.

• Ideal New Generation: Use two times the number of

transistors to gain three times performance with no increase

in die size and power.

In both ideal scenarios, there is three times gain in MIPS/watt

and no change in power density (watts/cm ).

In practice, it takes more than just process technology

to achieve such performance improvements and usually

at much higher costs. However, process technology is the

single most important technology that drives the microprocessor

industry. Growing 1000 times in frequency (from

1 MHz to 1 GHz) and integration (from 10k to 10M

devices) in 25 years was not possible without process

technology improvements.

Innovative circuit implementations can provide better performance

or lower power. New logic families provide new

methods to realize logic functions more effectively.

Microarchitecture attempts to increase both IPC and

frequency. A simple frequency boost applied to an existing

microarchitecture can potentially reduce IPC and thus

does not achieve the expected performance increase. For

326 PROCEEDINGS OF THE IEEE, VOL. 89, NO. 3, MARCH 2001

Fig. 1. Impact of different pipeline stalls on the execution flow.

example, memory accesses latency does not scale with microprocessor

frequency. Microarchitecture techniques such

as caches, branch prediction, and out-of-order execution can

increase IPC. Other microarchitecture ideas, most notably

pipelining, help to increase frequency beyond the increase

provided by process technology.

Modern architecture (ISA) and good optimizing compilers

can reduce the number of dynamic instructions executed

for a given program. Furthermore, given knowledge of

the underlying microarchitecture, compilers can produce optimized

code that lead to higher IPC.

This paper deals with the challenges facing architecture

and microarchitecture aspects of microprocessor design. A

brief tutorial/background on traditional microarchitecture is

given in Section II, focusing on frequency and IPC tradeoffs.

Section III describes the past and current trends in microarchitecture

and explains the limits of the current approaches

and the new challenges. Section IV suggests potential microarchitectural

solutions to these challenges.

II. MICROARCHITECTURE AT A GLANCE

Microprocessor performance depends on its frequency and

IPC. Higher frequency is achieved with process, circuit, and

microarchitectural improvements. New process technology

reduces gate delay time, thus cycle time, by 1.5 times. Microarchitecture

affects frequency by reducing the amount of

work done in each clock cycle, thus allowing shortening of

the clock cycle.

Microarchitects tend to divide the microprocessor's functionality

into three major components [5].

• Instruction Supply: Fetching instructions, decoding

them, and preparing them for execution;

• Execution: Choosing instructions for execution, performing

actual computation, and writing results;

• Data Supply: Fetching data from the memory hierarchy

into the execution core.

A rudimentary microprocessor would process a complete

instruction before starting a new one. Modern microprocessors

use pipelining. Pipelining breaks the processing of

an instruction into a sequence of operations, called stages.

For example, in Fig. 1, a basic four-stage pipeline breaks

the instruction processing into fetch, decode, execute, and

write-back stages. A new instruction enters a stage as soon

as the previous one completes that stage. A pipelined microprocessor

with pipeline stages can overlap the processing

of instructions in the pipeline and, ideally, can deliver

times the performance of a nonpipelined one.

Pipelining is a very effective technique. There is a clear

trend of increasing the number of pipe stages and reducing

the amount of work per stage. Some microprocessors (e.g.,

Pentium Pro microprocessor [6]) have more than ten pipeline

stages. Employing many pipe stages is sometimes termed

deep pipelining or super pipelining.

Unfortunately, the number of pipeline stages cannot increase

indefinitely.

• There is a certain clocking overhead associated with

each pipe stage (setup and hold time, clock skew). As

cycle time becomes shorter, further increase in pipeline

length can actually decrease performance [7].

• Dependencies among instructions can require stalling

certain pipe stages and result in wasted cycles, causing

performance to scale less than linearly with the number

of pipe stages.

For a given partition of pipeline stages, the frequency of the

microprocessor is dictated by the latency of the slowest pipe

stage. More expensive logic and circuit optimizations help

to accelerate the speed of the logic within the slower pipe

stage, thus reducing the cycle time and increasing frequency

without increasing the number of pipe stages.

It is not always possible to achieve linear performance increase

with deeper pipelines. First, scaling frequency linearly

with the number of stages requires good balancing of the

overall work among the stages, which is difficult to achieve.

Second, with deeper pipes, the number of wasted cycles,

termed pipe stalls, grows. The main reasons for stalls are resource

contention, data dependencies, memory delays, and

control dependencies.

• Resource contention causes pipeline stall when an instruction

needs a resource (e.g., execution unit) that is

currently being used by another instruction in the same

cycle.

• Data dependency occurs when the result of one instruction

is needed as a source operand by another instruction.

The dependent instruction has to wait (stall)

until all its sources are available.

RONEN et al.: COMING CHALLENGES IN MICROARCHITECTURE AND ARCHITECTURE 327

Table 1

Out-Of-Order Execution Example

• Memory delays are caused by memory related data

dependencies, sometimes termed load-to-use delays.

Accessing memory can take between a few cycles to

hundreds of cycles, possibly requiring stalling the pipe

until the data arrives.

• Control dependency stalls occur when the control

flow of the program changes. A branch instruction

changes the address from which the next instruction

is fetched. The pipe may stall and instructions are not

fetched until the new fetch address is known.

Fig. 1 shows the impact of different pipeline stalls on the

execution flow within the pipeline.

In a 1-GHz microprocessor, accessing main memory can

take about 100 cycles. Such accesses may stall a pipelined

microprocessor for many cycles and seriously impact the

overall performance. To reduce memory stalls at a reasonable

cost, modern microprocessors take advantage of the locality

of references in the program and use a hierarchy of memory

components. A small, fast, and expensive (in $/bit) memory

called a cache is located on-die and holds frequently used

data. A somewhat bigger, but slower and cheaper cache may

be located between the microprocessor and the system bus,

which connects the microprocessor to the main memory. The

main memory is yet slower, but bigger and inexpensive.

Initially, caches were small and off-die; but over time,

they became bigger and were integrated on chip with the

microprocessor. Most advanced microprocessors today employ

two levels of caches on chip. The first level is 32-128

kB—it typically takes two to three cycles to access and typically

catches about 95% of all accesses. The second level is

256 kB to over 1 MB—it typically takes six to ten cycles to

access and catches over 50% of the misses of the first level.

As mentioned, off-chip memory accesses may elapse about

100 cycles.

Note that a cache miss that eventually has to go to the

main memory can take about the same amount of time as

executing 100 arithmetic and logic unit (ALU) instructions,

so the structure of memory hierarchy has a major impact on

performance. Much work has been done in improving cache

performance. Caches are made bigger and heuristics are used

to make sure the cache contains those portions of memory

that are most likely to be used [8], [9].

Change in the control flow can cause a stall. The length

of the stall is proportional to the length of the pipe. In

a super-pipelined machine, this stall can be quite long.

Modern microprocessors partially eliminate these stalls by

employing a technique called branch prediction. When a

branch is fetched, the microprocessor speculates the direction

(taken/not taken) and the target address where a branch

will go and starts speculatively executing from the predicted

address. Branch prediction uses both static and runtime

information to make its predictions. Branch predictors today

are very sophisticated. They use an assortment of per-branch

(local) and all-branches (global) history information and can

correctly predict over 95% of all conditional branches [10],

[11]. The prediction is verified when the predicted branch

reaches the execution stage and if found wrong, the pipe is

flushed and instructions are fetched from the correct target,

resulting in some performance loss. Note that when a wrong

prediction is made, useless work is done on processing

instructions from the wrong path.

The next step in performance enhancement beyond

pipelining calls for executing several instructions in parallel.

Instead of "scalar" execution, where in each cycle only one

instruction can be resident in each pipe stage, superscalar

execution is used, where two or more instructions can

be at the same pipe stage in the same cycle. Superscalar

designs require significant replication of resources in order

to support the fetching, decoding, execution, and writing

back of multiple instructions in every cycle. Theoretically,

an -way superscalar pipelined microprocessor can

improve performance by a factor of over a standard

scalar pipelined microprocessor. In practice, the speedup is

much smaller. Interinstruction dependencies and resource

contentions can stall the superscalar pipeline.

The microprocessors described so far execute instructions

in-order. That is, instructions are executed in the program

order. In an in-order processing, if an instruction cannot continue,

the entire machine stalls. For example, a cache miss

delays all following instructions even if they do not need the

results of the stalled load instruction. A major breakthrough

in boosting IPC is the introduction of out-of-order execution,

where instruction execution order depends on data flow, not

on the program order. That is, an instruction can execute if its

operands are available, even if previous instructions are still

waiting. Note that instructions are still fetched in order. The

effect of superscalar and out-of-order processing is shown in

an example in Table 1 where two memory words mem1 and

mem3 are copied into two other memory locations mem2 and

mem4.

Out-of-order processing hides some stalls. For example,

while waiting for a cache miss, the microprocessor can

execute newer instructions as long as they are independent

of the load instructions. A superscalar out-of-order

microprocessor can achieve higher IPC than a superscalar

in-order microprocessor. Out-of-order execution involves

dependency analysis and instruction scheduling. Therefore,

it takes a longer time (more pipe stages) to process an

328 PROCEEDINGS OF THE IEEE, VOL. 89, NO. 3, MARCH 2001

Fig. 2. Processor frequencies over years. (Source: V. De, Intel, ISLPED, Aug. 1999.)

instruction in an out-of- order microprocessor.With a deeper

pipe, an out-of-order microprocessor suffers more from

branch mispredictions. Needless to say, an out-of-order

microprocessor, especially a wide-issue one, is much more

complex and power hungry than an in-order microprocessor

[12].

Historically, there were two schools of thought on how to

achieve higher performance. The "Speed Demons" school

focused on increasing frequency. The "Brainiacs" focused

on increasing IPC [13], [14]. Historically, DEC Alpha [15]

was an example of the superiority of "Speed Demons" over

the "Brainiacs." Over the years, it has become clear that high

performance must be achieved by progressing in both vectors

(see Fig. 4).

To complete the picture, we revisit the issues of performance

and power. A microprocessor consumes a certain

amount of energy, , in processing an instruction. This

amount increases with the complexity of the microprocessor.

For example, an out-of-order microprocessor consumes

more energy per instruction than an in-order microprocessor.

When speculation is employed, some processed instructions

are later discarded. The ratio of useful to total number

of processed instructions is . The total IPC including speculated

instructions is therefore IPC/ . Given these observations

a number of conclusions can be drawn. The energy per

second, hence power, is proportional to the amount of processed

instructions per second and the amount of energy consumed

per instruction, that is (IPC/ ) Frequency . The

energy efficiency, measured in MIPS/watt, is proportional to

. This value deteriorates as speculation increases and

complexity grows.

One main goal of microarchitecture research is to design a

microprocessor that can accomplish a group of tasks (applications)

in the shortest amount of time while using minimum

amount of power and incurring the least amount of cost. The

design process involves evaluating many parameters and balancing

these three targets optimally with given process and

circuit technology.

III. MICROPROCESSORS—CURRENT TRENDS AND

CHALLENGES

In the past 25 years, chip density and the associated computer

industry has grown at an exponential rate. This phenomenon

is known as "Moore's Law" and characterizes almost

every aspect of this industry, such as transistor density,

die area, microprocessor frequency, and power consumption.

This trend was possible due to the improvements in fabrication

process technology and microprocessor microarchitecture.

This section focuses on the architectural and the microarchitectural

improvements over the years and elaborates

on some of the current challenges the microprocessor industry

is facing.

A. Improving Performance

As stated earlier, performance can be improved by increasingIPCand/

or frequencyorbydecreasing the instruction

count. Several architecture directions have been taken to

improve performance. Reduced instruction set computer

(RISC) architecture seeks to increase both frequency and IPC

via pipelining and use of cache memories at the expense of

increased instruction count.Complexinstruction setcomputer

(CISC) microprocessors employ RISC-like internal representation

to achieve higher frequency while maintaining lower

instruction count. Recently, the very long instruction word

(VLIW) [16] concept was revived with the Explicitly Parallel

Instruction Computing (EPIC) [17]. EPIC uses the compiler

to schedule instruction statically. Exploiting parallelism staticallycanenablesimpler

control logicandhelpEPICto achieve

higherIPCandhigher frequency.

1) Improving Frequency via Pipelining: Process technology

and microarchitecture innovations enable doubling

the frequency increase every process generation. Fig. 2

presents the contribution of both: as the process improves,

the frequency increases and the average amount of work

done in pipeline stages decreases. For example, the number

of gate delays per pipe stage was reduced by about three

RONEN et al.: COMING CHALLENGES IN MICROARCHITECTURE AND ARCHITECTURE 329

Fig. 3. Frequency and performance improvements—synthetic model. (Source: E. Grochowski,

Intel, 1997.)

times over a period of ten years. Reducing the stage length

is achieved by improving design techniques and increasing

the number of stages in the pipe. While in-order microprocessors

used four to five pipe stages, modern out-of-order

microprocessors can use over ten pipe stages. With frequencies

higher than 1 GHz, we can expect over 20 pipeline

stages.

Improvement in frequency does not always improve

performance. Fig. 3 measures the impact of increasing the

number of pipeline stages on performance using a synthetic

model of an in-order superscalar machine. Performance

scales less than frequency (e.g., going from 6 to 12 stages

yields only a 1.75 times speedup, from 6 to 23 yields only 2.2

times). Performance improves less than linearly due to cache

misses and branch mispredictions. There are two interesting

singular points in the graph that deserve special attention.

The first (at pipeline depth of 13 stages) reflects the point

where the cycle time becomes so short that two cycles are

needed to reach the first level cache. The second (at pipeline

depth of 24 stages) reflects the point where the cycle time

becomes extremely short so that two cycles are needed

to complete even a simple ALU operation. Increasing the

latency of basic operations introduces more pipeline stalls

and impacts performance significantly. Please note that these

trends are true for any pipeline design though the specific

data points may vary depending on the architecture and the

process. In order to keep the pace of performance growth,

one of the main challenges is to increase the frequency

without negatively impacting the IPC. The next sections

discuss some IPC related issues.

2) Instruction Supply Challenges: The instruction

supply is responsible for feeding the pipeline with useful

instructions. The rate of instructions entering the pipeline

depends on the fetch bandwidth and the fraction of useful

instructions in that stream. The fetch rate depends on the

effectiveness of the memory subsystem and is discussed

later along with data supply issues. The number of useful

instructions in the instruction stream depends on the ISA and

the handling of branches. Useless instructions result from: 1)

control flow change within a block of fetched instructions,

leaving the rest of the cache block unused; and 2) branch

misprediction brings instructions from the wrong path that

are later discarded. On average, a branch occurs every four

to five instructions. Hence, appropriate fetch bandwidth and

accurate branch prediction are crucial.

Once instructions are fetched into the machine they are

decoded. RISC architectures, using fixed length instructions,

can easily decode instructions in parallel. Parallel decoding is

a major challenge for CISC architectures, such as IA32, that

use variable length instructions. Some implementations [18]

use speculative decoders to decode from several potential instruction

addresses and later discard the wrong ones; others

[19] store additional information in the instruction cache to

ease decoding. Some IA32 implementations (e.g., the Pentium

II microprocessor) translate the IA32 instructions into

an internal representation (micro-operations), allowing the

internal part of the microprocessor to work on simple instructions

at high frequency, similar to RISC microprocessors.

3) Efficient Execution: The front-end stages of the

pipeline prepare the instructions in either an instruction

330 PROCEEDINGS OF THE IEEE, VOL. 89, NO. 3, MARCH 2001

Fig. 4. Landscape of microprocessor families.

window [20] or reservation stations [21]. The execution core

schedules and executes these instructions. Modern microprocessors

use multiple execution units to increase parallelism.

Performancegainislimitedbytheamountofparallelismfound

inthe instructionwindow.Theparallelism intoday'smachines

is limited by the data dependencies in the program and by

memorydelaysandresource contention stalls.

Studies show that in theory, high levels of parallelism are

achievable [22]. In practice, however, this parallelism is not

realized, even when the number of execution units is abundant.

More parallelism requires higher fetch bandwidth, a

larger instruction window, and a wider dependency tracker

and instruction scheduler. Enlarging such structures involves

polynomial complexity increase for less than a linear performance

gain (e.g., scheduling complexity is on the order of

O of the scheduling window size [23]). VLIW architectures

[16] such as IA64 EPIC [17] avoid some of this complexity

by using the compiler to schedule instructions.

Accurate branch prediction is critical for deep pipelines in

reducing misprediction penalty. Branch predictors have become

larger and more sophisticated. The Pentium microprocessor

[18] uses 256 entries of 2-bit predictors (the predictor

and the target arrays consume 15 kB) that achieve 85%

correct prediction rate. The Pentium III microprocessor [24]

uses 512 entries of two-level local branch predictor (consuming

30 kB) and yields 90% prediction rate. The Alpha

21 264 [25] uses a hybrid multilevel selector predictor with

5120 entries (consuming 80 kB) and achieves 94% accuracy.

As pipelines become deeper and fetch bandwidth becomes

wider, microprocessors will have to predict multiple

branches in each cycle and use bigger multilevel branch

prediction structures similar to caches.

B. Accelerating Data Supply

All modern microprocessors employ memory hierarchy.

The growing gap between the frequency of the microprocessor

that doubles every two to three years and the main

memory access time that only increases 7% per year impose

a major challenge. The latency of today's main memory

is 100 ns, which approximately equals 100 microprocessor

cycles. The efficiency of the memory hierarchy is highly dependent

on the software and varies widely for different applications.

The size of cache memories increases according to

Moore's Law. The main reason for bigger caches is to

support a bigger working set. New applications such as

multimedia and communication applications use larger data

structures, hence bigger working sets, than traditional applications.

Also, the use of multiprocessing and multithreading

in modern operating systems such asWindowsNT and Linux

causes frequent switches among applications. This results in

further growth of the active working set.

Increasing the cache memory size increases its access

time. Fast microprocessors, such as the Alpha or the Pentium

III microprocessors, integrate two levels of caches

on the microprocessor die to get improved average access

time to the memory hierarchy. Embedded microprocessors

integrate bigger, but slower dynamic random access memory

(DRAM) on the die. DRAM on die involves higher latency,

manufacturing difficulty, and software complexity and

is, therefore, not attractive for use in current generation

general-purpose microprocessors. Prefetching is a different

technique to reduce access time to memory. Prefetching

anticipates the data or instructions the program will access

in the near future and brings them to the cache ahead

of time. Prefetching can be implemented as a hardware

mechanism or can be instrumented with software. Many

microprocessors use a simple hardware prefetching [26]

mechanism to bring ahead "one instruction cache line" into

the cache. This mechanism is very efficient for manipulating

instruction streams, but less effective in manipulating data

due to cache pollution. A different approach uses ISA

extensions; e.g., the Pentium III microprocessor prefetch

RONEN et al.: COMING CHALLENGES IN MICROARCHITECTURE AND ARCHITECTURE 331

instruction hints to the hardware, to prefetch a cache line. To

implement prefetching, the microarchitecture has to support

a "nonblocking" access to the cache memory hierarchy.

C. Frequency Versus IPC

SPEC rating is a standard measure of performance based

on total execution time of a SPEC benchmark suite. Fig. 4

plots the "landscape" of microprocessors based on their

performance. The horizontal axis is the megahertz rating of

a microprocessor's frequency. The vertical axis is the ratio

of SpecINT/MHz, which roughly corresponds to the IPC

assuming instruction count remains constant. The different

curves represent different levels of performance with increasing

performance as we move toward curves in the upper

right corner. All points on the same curve represent the same

performance level, i.e., SPEC rating. Performance can be

increased by either increasing the megahertz rating (moving

toward the right) or by increasing the SpecINT/MHz ratio

(moving toward the top) or by increasing both. For a given

family of microprocessors with the same ISA (and, hence,

the same instruction count), the SpecINT/MHz ratio is

effectively the measure of their relative IPC.

For example, let us examine the curve that represents the

Intel IA32 family of microprocessors. The first point in the

curve represents the Intel386 microprocessor. The next point

represents the Intel486 microprocessor. The main improvement

between these two microprocessors is due to the improvement

of IPC. This is obtained through the pipelined design

of the Intel486 microprocessor and the introduction of

the L1 cache.

The Pentium microprocessor was the first superscalar

machine Intel introduced; it also featured branch prediction

and split caches. Performance gain came from parallelism

as well as reduced stalls. Subsequent proliferation of the

Pentium microprocessor involved frequency boosts and

relatively small microarchitectural modifications, leaving

the IPC almost unchanged.

The next level of performance appears with the Pentium

Pro microprocessor. This microprocessor, followed later by

the Pentium II and Pentium III microprocessors, is a deeply

pipelined superscalar out-of-order microprocessor, which simultaneously

improved both frequency and IPC.

Other families of microprocessors show similar trends.

New microarchitecture can boost both IPC and frequency

while new process technology typically boosts frequency

only. In some cases (see the Alpha 21 064), higher frequency

may even reduce IPC, but overall performance is still

increased.

D. Power Scaling

Each new microprocessor generation introduces higher

frequency and higher power consumption. With three times

less energy per gate switch and 1.5 times frequency increase,

simple shrinking of a microprocessor to a new process

can reduce power consumption close to two times. A new

microarchitecture, on the other hand, increases work per

Fig. 5. Maximum power consumption. (Source: S. Borkar [4].)

Fig. 6. Power density evolution. (Source: S. Borkar [4].)

instruction and wastes energy on wrongly speculated instructions,

hence reducing energy efficiency and increasing

the total power consumption of the microprocessor.

Fig. 5 supports the above observation. When a new generation

of microprocessors is introduced, it consumes twice

the power of the previous generation using the same process.

After a while, the microprocessor moves to a new process

and improves both power and frequency. Extrapolation of

the power dissipation numbers suggests that microprocessors

may consume 2000Win the near future! Up until now, power

density was handled using packaging solution to dissipate

heat. Fig. 6 indicates that power density will soon become a

major problem. Packaging alone will not be able to address

power density in a cost-effective manner. Note that chips in

0.6- m technology used to have power density similar to a

(cooking) hot plate (10 W/cm ). If the trend continues, soon

we may experience a chip with the power density of a nuclear

power plant or even a rocket nozzle. Furthermore, local

hot spots have significantly higher power density than the average,

making the situation even worse. Potential solutions to

the power density problems are addressed in Section IV.

E. Application Specific Enhancements

The energy crisis encourages the development of application-

specific enhancements (ASEs), also termed

332 PROCEEDINGS OF THE IEEE, VOL. 89, NO. 3, MARCH 2001

focus-MIPS, which aim to achieve better performance and

better MIPS for specific classes of applications. Intel's

MMX technology and streaming single instruction multiple

data (SIMD) extensions are good examples. These SIMD

extensions allow a single instruction to perform the same

operation on two, four, or eight data elements at the same

time, potentially improving performance and reducing

power appropriately. Other microprocessor manufacturers

have introduced similar extensions—AMD 3DNOW! [27],

Sun VIS [28], Motorola Altivec [29] technology, and more.

F. Coming Challenges

Until recently, microarchitects and designers used the

increased transistor budget and shorter gate delay to focus

mainly on developing faster, bigger, and hotter microprocessors.

Lower cost and lower power microprocessors were

made available by waterfalling—a process shrink that takes

yesterday's hot and expensive microprocessor and makes

it today's low-power and inexpensive microprocessor. This

will change. Power consumed by future high-performing

microprocessors will reach levels that cannot suit mobile

computers even after a process shrink.

Single thread performance does not scale with frequency

and area. Memory latencies and bandwidth do not scale

as well either, slowing down memory-bound applications.

Deeper pipes allow higher frequency at the cost of longer

stalls due to branch misprediction and load-to-use delays. A

design challenge is to minimize stalls while still increasing

instruction-level parallelism (ILP).

Controlling power becomes a major challenge. Process

technology reduces the switching power of a single device,

but higher frequency and increased device density increase

power dissipation at a faster rate. Power density is reaching

levels close to nuclear reactors. Power is becoming the

limiting factor in microprocessor design. Microarchitecture

should help reduce power.

ASE may be a good power efficient design alternative

but at a cost. Developing ASEs requires a lot of work in

defining new architectures—writing new development tools

and porting existing applications to the new architecture.

With frequencies in excess of 1 GHz, even moving at the

speed of light, signals can only travel a short distance. In the

future, it may take several cycles for a signal to travel across

the chip. Wire delays exhibit an inherent conflict of size and

time: bigger structures that may improve performance (e.g.,

caches and branch predictors) may take longer to access. Incorporating

large structures that provide higher ILP without

sacrificing frequency will be one of the main challenges.

IV. FUTURE DIRECTIONS

In this section, we present potential directions and solutions

to address these challenges. Many solutions often require

difficult tradeoffs. In particular, we can exploit more

performance, but at the cost of power and area. While numerous

techniques are proposed, the applicability of a certain

technique to a certain market segment has to be examined according

to the intended usage of the microprocessor.

A. Single Thread Performance Improvements

As stated earlier, performance is dictated by frequency

and IPC. Process technology is likely to continue to enable

frequency scaling close to 1.5 times per generation. Scaling

frequency beyond process advancement requires reduction

in the number of logic levels per pipe stage. This number

is already low and reducing it further is a complex task

with questionable gain. Certain operations, usually those

with a very short dependency path, do benefit from higher

frequency. Other operations scale more easily by running

several of them in parallel rather than by trying to pipeline

them or run them faster, even at the cost of duplicating the

logic. To address this, there may be different clock domains

on the same microprocessor. Simple execution units (e.g.,

adders) may run at a higher frequency, enabling very fast

results forwarding to dependent instructions, thus cutting

the program critical path and improving performance. Other

operations such as decoding multiple RISC instructions are

more easily implemented using slower parallel decoders

than faster pipelined ones.

Many techniques have the potential to increase IPC. Techniques

are described according to the function they influence—

instruction supply, execution, and data supply.

1) Instruction Supply: Instruction supply can be improved

by reducing the number of stalls (no instructions are

fetched) and by increasing the fetch bandwidth (the number

of instructions fetched per cycle). The major sources

for stalls are instruction cache misses and mispredicted

branches. Instruction cache stalls can be addressed by

general memory hierarchy optimization methods. Some

relevant techniques are mentioned under data supply later.

Mispredicted branches create long stalls. Even at 5% misprediction

rate, shaving an additional percent off may improve

performance of a deeply pipelined microprocessor by

4%! More sophisticated adaptive or hybrid branch predictors

have been proposed [30].

Accepting that some branches will be mispredicted, several

mechanisms have been proposed to reduce the misprediction

penalty. One direction is to reduce the length of the

pipeline by maintaining a decoded instruction cache. Especially

for the IA-32 architecture, eliminating the decoding

stage following a misprediction can reduce stall penalty significantly.

A similar, but potentially more cost-effective approach

suggests holding a decoded version of the alternate

program path only.

A more ambitious idea, termed eager execution, suggests

executing both the taken and the not taken paths of difficult

to predict branches [31]. When the branch is resolved,

the wrong path is flushed. Eager execution should use

a confidence mechanism to determine the likelihood of

a branch to be mispredicted [32]. Eager execution may

boost performance, but it is a very inefficient technique.

Both paths are executed and consume power with only

one doing useful work. Moreover, the ability to perform

RONEN et al.: COMING CHALLENGES IN MICROARCHITECTURE AND ARCHITECTURE 333

both paths without performance penalty requires additional

resources—increasing cost and complexity.

Another goal is to increase the number of fetched instructions

per cycle. To do that in a sustainable manner, one needs

to fetch more than one contiguous basic block at a time. The

initial attempt is to try to perform multiple branch predictions

and multiple block prefetching. These techniques increase

fetch bandwidth but involve multiporting and complex

steering logic to arrange instructions coming from several

different cache lines. Trace caches, block caches, and

extended basic block caches (XBCs) were proposed as better

alternatives that provide both high bandwidth and low latencies

[33]-[36]. These cache-like structures combine instructions

from various basic blocks and hold them together. Optionally,

the instructions can be stored in decoded form to

reduce latency. Traces are combined based on runtime behavior.

Traces take time to build but, once used, exhibit very

fast fetching and execution on every usage.

Several studies are extending the concept of trace caches

and annotate or reorder instructions within the traces (scheduled

trace cache). This enables faster dependency tracking

and instruction scheduling, gaining higher execution bandwidth

and lower latency at the cost of more complicated trace

building and less flexibility in accessing portions of the trace

[37].

Instruction fusion reduces the amount of work by treating

several (e.g., two) instructions as a single combined instruction,

best thought of as carpooling. The longer the two instructions

can travel together in the pipeline, fewer resources

are needed and less power is consumed. In particular, in cases

where we can build an execution unit that can execute dependent

fused instructions together, we can reduce the program

critical path and gain performance [38].

2) Execution: Increased instruction-level parallelism

starts with a wider machine. More execution units, a larger

out-of-order instruction window, and the ability to process

more dependency tracking and instruction scheduling per

cycle are required, but are not enough. With out-of-order

execution, the main limiting factor is not lack of resources;

it is data dependency imposed by the data flow among

instructions.

For years, data flow was perceived as a theoretical barrier

that cannot be broken. Recently, several techniques, collectively

termed as beyond data flow, showed that this barrier

can be broken, resulting in more parallelism and increased

performance. Some of these techniques use super speculation:

they predict results, addresses, or relations among instructions

to cut dependencies.

Value prediction [39]-[41] is probably the best example in

this domain. Value prediction tries to predict the result of an

incoming instruction based on previously computed results

and, optionally, the program path (control flow) leading to

that instruction. When a prediction is made, the instructions

that depend on that result can be dispatched without waiting

for the actual computation of the predicted result. The actual

computation is done for verification only. Of course, when

the prediction is wrong, already executed dependent instructions

must be reexecuted. Since mispredictions cost both performance

and power, the rate of misprediction has to be minimized.

While one would assume that the probability of predicting

a value out of 2 or 2 potential values is rather low, studies

show that about 40% and more of the results can be correctly

predicted. The simplest predictor is a last-value predictor,

which predicts that the last result of an instruction will also

be the result when the same instruction is executed again.

More sophisticated predictors that identify patterns of values

achieve even higher prediction rate.A40% correct prediction

rate does not mean that the other 60% are wrong. Confidence

mechanisms are used to make sure we predict only those instructions

that are likely to be predicted correctly, reducing

the number of mispredictions to 1%-2% and less.

Similarly, memory-address prediction predicts the

memory address needed by a load instruction in case the

load address cannot be computed since, for example, it

contains a register that is not known yet. When the predictor

provides an address with reasonable confidence, the load can

be issued immediately, potentially cutting the load-to-use

delay. Predicting store addresses can also improve performance.

Several address predictors were developed to predict

various common address patterns [42].

Instruction reuse is a family of nonspeculative techniques

that try to trade off computation with table lookup. The

simplest (and oldest) form of this technique is value cache

[43], where long latency operations (e.g., divide) are cached

along with their operands and results. Future occurrences

are looked up in the cache for match before execution. The

more interesting form of instruction reuse [44] tries to examine

a sequence of incoming instructions along with their

combined input data to produce the combined result without

actually computing it. The technique is nonspeculative and,

thus, can provide both performance and power reduction,

but it does require large amount of storage.

Memory addresses are computed at execution time, thus

dependency among memory accesses is not simple to determine.

In particular, the microprocessor cannot determine

whether a certain load accesses the same address as a previous

store. If the addresses are different, the load can be

advanced ahead of the store. If they are the same, the store

can forward the result directly to the load without requiring

the data to be written first to memory. Microprocessors today

tend to take the conservative approach in which all previous

store addresses have to be known before the load can be issued.

Memory disambiguation techniques [45] have been devised

to predict, with high confidence, whether a load collides

with a previous store to allow advancing or value forwarding.

Memory disambiguation can be taken a step forward. If the

exact load-store pair can be predicted, the load can bypass the

store completely, taking its data directly from the producer

of the value to be stored. Furthermore, in such a case, the

consumers of the load, that is, the instructions that use the

loaded value can also take their operands directly from the

producer, bypassing both the store and loads and boosting

performance significantly [46]. A generalized version of this

concept, termed unified renaming, uses a value identity de-

334 PROCEEDINGS OF THE IEEE, VOL. 89, NO. 3, MARCH 2001

tector to determine whether two future results will actually

be the same [47]. If so, all references to the later result are

converted to reference the former result, thus collapsing the

program critical path. Loads following stores are covered by

this technique since the load result is identical to the source

of the store result.

For example, in the IA-32 architecture, the stack pointer is

frequently changed by push/pop operations that add/subtract

a constant from the stack pointer value. Every push/pop depends

on all previous push/pops. The register tracking mechanism

computes the stack pointer values at decode, making

them all independent of one another [48].

3) Data Supply: Probably the biggest performance challenge

today is the speed mismatch between the microprocessor

and memory. If outstanding cache misses generated by

a single thread, task, or program can be grouped and executed

in an overlapped manner, overall performance is improved.

The term memory-level parallelism (MLP) [49] refers to the

number of outstanding cache misses that can be generated

and executed in an overlapping manner. Many microarchitectural

techniques have been devised to increase MLP.

Many of the techniques mentioned in the execution

section above increase MLP. Value prediction, memory

address prediction, and memory disambiguation enable

speculative advancing of loads beyond what a conservative

ordering would allow, potentially resulting in overlapping

loads. Load-store forwarding and unified renaming eliminate

dependencies among memory operations and increase

memory parallelism. Surely, more resources such as a larger

instruction window, more load/store execution units, and

more memory ports will increase MLP.

A different technique to increase MLP is data prefetching.

Data prefetching attempts to guess which portions of

memory will be needed and to bring them into the microprocessor

ahead of time. In floating-point and multimedia

applications, the memory access pattern consists of several

simple contiguous data streams. Special hardware tracks

memory accesses looking for such patterns [50]. When

detected, an access to a line so patterned triggers the fetch

of the next cache line into an ordinary cache or a prefetch

buffer. More sophisticated techniques (e.g., context-based

prefetching [51]) can also improve other types of applications.

Prefetching algorithms try to consider the specific

memory-paging behavior [synchronous DRAM (SDRAM),

rambus DRAM (RDRAM)] to optimize performance and

power. Prefetching costs extra hardware, may overload the

buses, and may thrash the cache. Thus, prefetch requests

should be issued only when there is enough confidence that

they will be used.

Embedded DRAM and similar process/circuit techniques

attempt to increase the amount of memory on die. If

enough on-die memory is available, it may be used not

only as a cache, but also as a faster memory. Operating

system/application software may adapt known methods

from networked/microprocessor/nonuniform memory access

systems to such architectures. Big memory arrays, to

a certain degree, are also considered more energy efficient

than complex logic. Since memories are more energy

efficient than logic, DRAM on die may become an attractive

feature in the future.

Of course, much effort is being put into improving traditional

caches.With the increase in microprocessor frequency,

more levels of memories will be integrated on the die; on the

other hand, the size of the L1 cache will be reduced to keep up

with the ultrahigh frequency. Such memory organization may

cause a coherency problem within the chip. This problem

may be resolved using similar methods that resolve the coherency

problem in a multiprocessor (MP) system today. In

addition, researches will continue to look for better replacement

algorithms and additional structures to hold streaming

data so as to eliminate cache thrashing.

4) Pushing ILP/MLP Further: Previous sections discuss

ways to increase parallelism in a single thread. The next level

of parallelism is achieved by running several threads in parallel,

thus achieving increased parallelism at a coarser grain

level. This section briefly covers several ideas that aim at

speeding up single thread performance by splitting it into

several threads. Improving parallelism by running several independent

threads simultaneously is addressed at the end of

this section.

Selective preprocessing of instructions was suggested to

reduce branch misprediction and cache miss stalls. The idea

is to use a separate "processor" or thread to run ahead of

the actual core processor and execute only those instructions

that are needed to resolve the difficult-to-predict branches

or compute the addresses of loads which are likely to cause

cache misses. Early execution of such loads acts very much

like data prefetch. Precomputed correct control information

is passed to the real thread, which can reduce the number of

branch mispredictions.

Dynamic multithreading [52] attempts to speculatively execute

instructions following a call or loop-end along with the

called procedure or the loop body (thread speculation). Dependencies

among the executed instruction blocks may be

circumvented, by value prediction.With sufficient resources,

useful work can be done speculatively. Even if some of the

work is found erroneous and has to be redone, prefetching

side effects may help performance.

The multiscalar technique [53] extends the concept of

speculative threads beyond dynamic multithreading. The

classic multiscalar idea calls for the help of the user and/or

the compiler to define potential threads and to provide

dependency information. A thread is ideally a self-contained

piece of work with as few inputs and outputs as possible.

A separate mechanism verifies the user/compiler provided

dependencies (e.g., memory conflicts).

B. Process Technology Challenges

Besides pushing the performance envelope of a

single-thread instruction stream, microarchitects are

faced with some process technology related challenges.

1) Power Challenge: So far we have mainly considered

the active power and ignored the leakage power. Leakage

power is generated by current running in gates and wires

RONEN et al.: COMING CHALLENGES IN MICROARCHITECTURE AND ARCHITECTURE 335

even when there is no activity. There are three leakage

sources: 1) the subthreshold drain-source leakage; 2)

junction leakage; and 3) gate-substrate leakage. To gain

performance, lower threshold voltage ( ) is used. Lower

, however, increases leakage power (subthreshold). Also,

higher power densities increase device temperature, which

in turn increases leakage power (junction). As technology

scales, oxide thickness is also scaled which causes more

leakage power (gate). As such, controlling power density is

becoming even more crucial in future deep submicron technologies.

When adding complexity to exploit performance,

one must understand and consider the power impact. In

addition, one needs to target the reduction of standby power

dissipation. There are three general approaches to power

reduction: 1) increase energy efficiency; 2) conserve energy

when a module is not in use; and 3) recycle and reuse.

Following are several possible suggested solutions.

Trading frequency for IPC is generally perceived as

energy efficient. Higher IPC with lower frequency and a

lower voltage can achieve the same performance but with

less power. This is not always accurate. Usually, higher IPC

requires additional complexity that involves additional work

to be done on each instruction (e.g., dependency tracking).

Moreover, many instructions are speculatively processed

and later discarded or reexecuted (e.g., wrong data was

predicted). Some performance-related techniques (e.g., load

bypassing) do reduce the work performed per instruction

and thus are more energy friendly. Better predictors with

confidence-level evaluation can be combined to control

speculation and avoid the waste of energy.

It is common sense to shut down circuits when they are not

in use to conserve energy. Shutting down logic and circuits

can be applied to part or the whole microprocessor. However,

the sharp current swings associated with frequent gating

cause " " noise. This noise is also referred to as the simultaneous

switching noise and is characterized by the equation:

, where is the voltage fluctuation,

is the inductance of the wire supplying the current, is the

current change, and is the rise or fall time of gates. One potential

area of research is to determine the right time to shut

down blocks to avoid sharp current swings. Prediction algorithms

based on program dynamic behavior may be useful.

Dynamic voltage/frequency scaling [54] can reduce

energy consumption in microprocessors without impacting

the peak performance. This approach varies the microprocessor

voltage or frequency under software control to

meet dynamically varying performance requirements. This

method conserves energy when maximum throughput is

not required. Intel's SpeedStep technology [55] and Transmeta's

LongRun technology [56], [57] both use dynamic

voltage/frequency scaling.

Some microarchitecture ideas can actually reduce the

number of instructions [58] processed, thus saving energy.

Instruction reuse, mentioned earlier, saves previously executed

results in a table and can trade repeated execution

with table lookup, potentially increasing performance and

saving energy.

2) Wire Delays: Wiring is becoming a bottleneck for frequency

improvement. Wires do not scale like gates. Wire

delay depends on the wire resistance, which is inversely proportional

to the cross section of a wire, and the wire spacing

distance causing the coupling capacitance between sides of

adjacent wires. If all dimension of a wire are scaled (width,

thickness, and length), wire delay remains fixed (in nanoseconds)

for the scaled length [59]. Incr