die hard 4 computers: May 2008

Saturday, May 31, 2008

EXAMPLES OF CPU'S

One could argue that the obsolete and discontinued models no longer have any practical significance. This is true to some extent; but the old processors form part of the “family tree”, and there are still legacies from their architectures in our modern CPU’s, because the development has been evolutionary. Each new processor extended and built “on top of” an existing architecture.

The evolutionary development spirals ever outwards.

There is therefore value (one way or another) in knowing about the development from one generation of CPU’s to the next. If nothing else, it may give us a feeling for what we can expect from the future.

16 bits – the 8086, 8088 and 80286

The first PC’s were 16-bit machines. This meant that they could basically only work with text. They were tied to DOS, and could normally only manage one program at a time.

But the original 8086 processor was still “too good” to be used in standard office PC’s. The Intel 8088 discount model was therefore introduced, in which the bus between the CPU and RAM was halved in width (to 8 bits), making production of the motherboard much cheaper. 8088 machines typically had 256 KB, 512 KB or 1 MB of RAM. But that was adequate for the programs at the time.

The Intel 80286 (from 1984) was the first step towards faster and more powerful CPU’s. The 286 was much more efficient; it simply performed much more work per clock tick than the 8086/8088 did. A new feature was also the 32 bit protected mode – a new way of working which made the processor much more efficient than under real mode, which the 8086/8088 processor forced programs to work in:

Access to all system memory – even beyond the 1MB limit which applied to real mode.

Access to multitasking, which means that the operating system can run several programs at the same time.

The possibility of virtual memory, which means that the hard disk can be used to emulate extra RAM, when necessary, via a swap file.

32 bit access to RAM and 32 bit drivers for I/O devices.

Protected mode paved the way for the change from DOS to Windows, which only came in the 1990’s.

an Intel 8086, the first 16-bit processor. Top: the incredibly popular 8-bit processor, the Zilog Z80, which the 8086 and its successors out competed.

32 bits – the 80386 and 486

The Intel 80386 was the first 32-bit CPU. The 386 has 32-bit long registers and a 32-bit data bus, both internally and externally. But for a traditional DOS based PC, it didn’t bring about any great revolution. A good 286 ran nearly as fast as the first 386’s – under DOS anyway, since it doesn’t exploit the 32-bit architecture.

The 80386SX became the most popular chip – a discount edition of the 386DX. The SX had a 16-bit external data bus (as opposed to the DX’s 32-bit bus), and that made it possible to build cheap PC’s.

The fourth generation

The fourth generation of Intel’s CPU’s was called the 80486. It featured a better implementation of the x86 instructions – which executed faster, in a more RISC-like manner. The 486 was also the first CPU with built-in L1 cache. The result was that the 486 worked roughly twice as fast as its predecessor – for the same clock frequency.

With the 80486 we gained a built-in FPU. Then Intel did a marketing trick of the type we would be better off without. In order to be able to market a cheap edition of the 486, they hit on the idea of disabling the FPU function in some of the chips. These were then sold under the name, 80486SX. It was ridiculous – the processors had a built-in FPU; it had just been switched off in order to be able to segment the market.

Two 486’s from two different manufacturers.

But the 486 was a good processor, and it had a long life under DOS, Windows 3.11 and Windows 95. New editions were released with higher clock frequencies, as they hit on the idea of doubling the internal clock frequency in relation to the external (see the discussion later in the guide). These double-clocked processors were given the name, 80486DX2.

A very popular model in this series had an external clock frequency of 33 MHz (in relation to RAM), while working at 66MHz internally. This principle (double-clocking) has been employed in one way or another in all later generations of CPU’s. AMD, IBM, Texas Instruments and Cyrix also produced a number of 80486 compatible CPU’s.

Pentium

In 1993 came the big change to a new architecture. Intel’s Pentium was the first fifth-generation CPU. As with the earlier jumps to the next generation, the first versions weren’t especially fast. This was particularly true of the very first Pentium 60 MHz, which ran on 5 volts. They got burning hot – people said you could fry an egg on them. But the Pentium quickly benefited from new process technology, and by using clock doubling, the clock frequencies soon skyrocketed.

Basically, the major innovation was a superscalar architecture. This meant that the Pentium could process several instructions at the same time (using several pipelines). At the same time, the RAM bus width was increased from 32 to 64 bits.

The Pentium processor could be viewed as two 80486’s built into one chip.

Throughout the 1990’s, AMD gained attention with its K5 and K6 processors, which were basically cheap (and fairly poor) copies of the Pentium. It wasn’t until the K6-2 (which included the very successful 3DNow! extensions), that AMD showed the signs of independence which have since led to excellent processors like the AthlonXP.

One of the earlier AMD processors. Today you’d hesitate to trust it to run a coffee machine…

In 1997, the Pentium MMX followed (with the model name P55), introducing the MMX instructions already mentioned. At the same time, the L1 cache was doubled and the clock frequency was raised.

The Pentium MMX. On the left, the die can be seen in the middle.

Pentium II with new cache

After the Pentium came the Pentium II. But Intel had already launched the Pentium Pro in 1995, which was the first CPU in the 6^th generation. The Pentium Pro was primarily used in servers, but its architecture was re-used in the popular Pentium II, Celeron and Pentium III models, during 1997-2001.

The Pentium II initially represented a technological step backwards. The Pentium Pro used an integrated L2 cache. That was very advanced at the time, but Intel chose to place the cache outside the actual Pentium II chip, to make production cheaper.

L2 cache running at half CPU speed in the Pentium II.

The Level 2 cache was placed beside the CPU on a circuit board, an SEC module The module was installed in a long Slot 1 socket on the motherboard. Fig. 106 shows the module with a cooling element attached. The CPU is sitting in the middle (under the fan). The L2 cache is in two chips, one on each side of the processor.

Pentium II processor module mounted on its edge in the motherboard’s Slot 1 socket (1997-1998).

The disadvantage of this system was that the L2 cache became markedly slower than it would have been if it was integrated into the CPU. The L2 cache typically ran at half the CPU’s clock frequency. AMD used the same system in their first Athlons. For these the socket was called, Slot A .

At some point, Intel decided to launch a discount edition of the Pentium II – the Celeron processor. In the early versions, the L2 cache was simply scrapped from the module. That led to quite poor performance, but provided an opportunity for overclocking.

Overclocking means pushing a CPU to work at a higher frequency than it is designed to work at. It was a very popular sport, especially early on, and the results were good.

One of the first AMD Athlon processors, mounted in a Slot A socket. See the large cooling element.

One of the problems of overclocking a Pentium II was that the cache chips couldn’t keep up with the high speeds. Since these Celerons didn’t have any L2 cache, they could be seriously overclocked (with the right cooling).

. Extreme CPU cooling using a complete refrigerator built into the PC cabinet. With equipment like this, CPU’s can be pushed up to very high clock frequencies (See Kryotech.com and Asetek.com).

Intel later decided to integrate the L2 cache into the processor. That happened in a new versions of the Celeron in 1998 and a new versions of the Pentium III in 1999. The socket design was also changed so that the processors could be mounted directly on the motherboard, in a socket called socket 370. Similarly, AMD introduced their socket A.

Pentium 4 – long in the pipe

The Pentium III was really just (yet) another edition of the Pentium II, which again was a new version of the Pentium Pro. All three processors built upon the same core architecture (Intel P6).

It wasn’t until the Pentium 4 came along that we got a completely new processor from Intel. The core (P7) had a completely different design:

The L1 cache contained decoded instructions.

The pipeline had been doubled to 20 stages (in later versions increased to 31 stages).

The integer calculation units (ALU’s) had been double-clocked so that they can perform two micro operations per clock tick.

Furthermore, the memory bus, which connects the RAM to the north bridge, had been quad-pumped, so that it transfers four data packets per clock tick. That is equivalent to 4 x 100 MHz and 4 x 133 in the earliest versions of the Pentium 4. In later version the bus was pumped up to 4 x 200 MHz, and an update with 4 x 266 MHz is scheduled for 2005.

The processor was Hyper Threading-enabled, meaning that it under certain circumstances may operate as two individual CPUs.

All of these factors are described elsewhere in the guide. The important thing to understand, is that the Pentium 4 represents a completely new processor architecture.

The four big changes seen in the Pentium 4.

DATA AND INSTRUCTIONS

Now it’s time to look more closely at the work of the CPU. After all, what does it actually do?

Instructions and data

Our CPU processes instructions and data. It receives orders from the software. The CPU is fed a gentle stream of binary data via the RAM.

These instructions can also be called program code. They include the commands which you constantly – via user programs – send to your PC using your keyboard and mouse. Commands to print, save, open, etc.

Data is typically user data. Think about that email you are writing. The actual contents (the text, the letters) is user data. But when you and your software say “send”, your are sending program code (instructions) to the processor:

instructions process the user data.

Instructions and compatibility

Instructions are binary code which the CPU can understand. Binary code (machine code) is the mechanism by which PC programs communicate with the processor.

All processors, whether they are in PC’s or other types of computers, work with a particular instruction set. These instructions are the language that the CPU understands, and thus all programs have to communicate using these instructions. Here is a simplified example of some “machine code” – instructions written in the language the processor understands:

proc near

mov AX,01

mov BX,01

inc AX

add BX,AX

You can no doubt see that it wouldn’t be much fun to have to use these kinds of instructions in order to write a program. That is why people use programming tools. Programs are written in a programming language (like Visual Basic or C++). But these program lines have to be translated into machine code, they have to be compiled, before they can run on a PC. The compiled program file contains instructions which can be understood by the particular processor (or processor family) the program has been “coded” for:

The program code produced has to match the CPU’s instruction set. Otherwise it cannot be run.

The processors from AMD and Intel which we have been focusing on in this guide, are compatible, in that they understand the same instructions.

There can be big differences in the way two processors, such as the Pentium and Pentium 4, process the instructions internally. But externally – from the programmer’s perspective – they all basically function the same way. All the processors in the PC family (regardless of manufacturer) can execute the same instructions and hence the same programs.

And that’s precisely the advantage of the PC: Regardless of which PC you have, it can run the Windows programs you want to use.

. The x86 instruction set is common to all PC’s.

As the years have passed, changes have been made in the instruction set along the way. A PC with a Pentium 4 processor from 2002 can handle very different applications to those which an IBM XT with an 8088 processor from 1985 can. But on the other hand, you can expect all the programs which could run on the 8088, to still run on a Pentium 4 and on a Athlon 64. The software is backwards compatible.

The entire software industry built up around the PC is based on the common x86 instruction, which goes back to the earliest PC’s. Extensions have been made, but the original instruction set from 1979 is still being used.

x86 and CISC

People sometimes differentiate between RISC and CISC based CPU’s. The (x86) instruction set of the original Intel 8086 processor is of the CISC type, which stands for Complex Instruction Set Computer.

That means that the instructions are quite diverse and complex. The individual instructions vary in length from 8 to 120 bits. It is designed for the 8086 processor, with just 29,000 transistors. The opposite of CISC, is RISC instructions.

RISC stands for Reduced Instruction Set Computer, which is fundamentally a completely different type of instruction set to CISC. RISC instructions can all have the same length (e.g. 32 bits). They can therefore be executed much faster than CISC instructions. Modern CPU’s like the AthlonXP and Pentium 4 are based on a mixture of RISC and CISC.

PC’s running Windows still work with the old fashioned CISC instructions.

In order to maintain compatibility with the older DOS/Windows programs, the later CPU’s still understand CISC instructions. They are just converted to shorter, more RISC-like, sub-operations (called micro-ops), before being executed. Most CISC instructions can be converted into 2-3 micro-ops.

The CISC instructions are decoded before being executed in a modern processor. This preserves compatibility with older software.

Extensions to the instruction set

For each new generation of CPU’s, the original instruction set has been extended. The 80386 processor added 26 new instructions, the 80486 added six, and the Pentium added eight new instructions.

At the same time, execution of the instructions was made more efficient. For example, it took an 80386 processor six clock ticks to add one number to a running summation. This task could be done in the 80486 (see page 40), in just two clock ticks, due to more efficient decoding of the instructions.

These changes have meant that certain programs require at least a 386 or a Pentium processor in order to run. This is true, for example, of all Windows programs. Since then, the MMX and SSE extensions have followed, which are completely new instruction sets which will be discussed later in the guide. They can make certain parts of program execution much more efficient.

Another innovation is the 64-bit extension, which both AMD and Intel use in their top-processors. Normally the pc operates in 32-bit mode, but one way to improve the performance is using a 64-bit mode. This requires new software, which is not available yet.

9. Inside the CPU

Instructions have to be decoded, and not least, executed, in the CPU. I won’t go into details on this subject; it is much too complicated. But I will describe a few factors which relate to the execution of instructions. My description has been extremely simplified, but it is relevant to the understanding of the microprocessor. This chapter is probably the most complicated one in the guide – you have been warned! It’s about:

Pipelines

Execution units

If we continue to focus on speeding up the processor’s work, this optimisation must also apply to the instructions – the quicker we can shove them through the processor, the more work it can get done.

Pipelines

As mentioned before, instructions are sent from the software and are broken down into micro-ops (smaller sub-operations) in the CPU. This decomposition and execution takes place in a pipeline.

The pipeline is like a reverse assembly line. The CPU’s instructions are broken apart (decoded) at the start of the pipeline. They are converted into small sub-operations (micro-ops), which can then be processed one at a time in the rest of the pipeline:

First the CISC instructions are decoded and converted into more digestible micro instructions. Then these are processed. It all takes place in the pipeline.

The pipeline is made up of a number stages. Older processors have only a few stages, while the newer ones have many (from 10 to 31). At each stage “something” is done with the instruction, and each stage requires one clock tick from the processor.

The pipeline is an assembly line (shown here with 9 stages), where each clock tick leads to the execution of a sub-instruction.

Modern CPU’s have more than one pipeline, and can thus process several instructions at the same time. For example, the Pentium 4 and AthlonXP can decode about 2.5 instructions per clock tick.

The first Pentium 4 has several very long pipelines, allowing the processor to hold up to 126 instructions in total, which are all being processed at the same time, but at different stages of execution . It is thus possible to get the CPU to perform more work by letting several pipelines work in parallel:

Having two pipelines allows twice as many instructions to be executed within the same number of clock ticks.

CPU	Instructions executed at the same time
AMD K6-II	24
Intel Pentium III	40
AMD Athlon	72
Intel Pentium 4 (first generation)	126

By making use of more, and longer, pipelines, processors can execute more instructions at the same time.

The problems of having more pipelines

One might imagine that the engineers at Intel and AMD could just make even more parallel pipelines in the one CPU. Perhaps performance could be doubled? Unfortunately it is not that easy.

It is not possible to feed a large number of pipelines with data. The memory system is just not powerful enough. Even with the existing pipelines, a fairly large number of clock ticks are wasted. The processor core is simply not utilised efficiently enough, because data cannot be brought to it quickly enough.

Another problem of having several pipelines arises when the processor can decode several instructions in parallel – each in its own pipeline. It is impossible to avoid the wrong instruction occasionally being read in (out of sequence). This is called misprediction and results in a number of wasted clock ticks, since another instruction has to be fetched and run through the “assembly line”.

Intel has tried to tackle this problem using a Branch Prediction Unit, which constantly attempts to guess the correct instruction sequence.

Length of the pipe

The number of “stations” (stages) in the pipeline varies from processor to processor. For example, in the Pentium II and III there are 10 stages, while there are up to 31 in the Pentium 4.

In the Athlon, the ALU pipelines have 10 stages, while the FPU/MMX/SSE pipelines have 15.

The longer the pipeline, the higher the processor’s clock frequency can be. This is because in the longer pipelines, the instructions are cut into more (and hence smaller) sub-instructions which can be executed more quickly.

CPU	Number of pipeline stages	Maximum clock frequency
Pentium	5	300 MHz
Motorola G4	4	500 MHz
Motorola G4e	7	1000 MHz
Pentium II and III	12	1400 MHz
Athlon XP	10/15	2500 MHz
Athlon 64	12/17	>3000 MHz
Pentium 4	20	>3000 MHz
Pentium 4 „Prescott“	31	>5000 MHz

clock frequencies require long “assembly lines” (pipelines).

Note that the two AMD processors have different pipeline lengths for integer and floating point instructions. One can also measure a processor’s efficiency by looking at the IPC number (Instructions Per Clock), and AMD’s Athlon XP is well ahead of the Pentium 4 in this regard. AMD’s Athlon XP processors are actually much faster than the Pentium 4’s at equivalent clock frequencies.

The same is even more true of the Motorola G4 processors used, for example, in Macintosh computers. The G4 only has a 4-stage pipeline, and can therefore, in principle, offer the same performance as a Pentium 4, with only half the clock frequency or less. The only problem is, the clock frequency can’t be raised very much with such a short pipeline. Intel have therefore chosen to future-proof the Pentium 4 by using a very long pipeline.

Execution units

What is it that actually happens in the pipeline? This is where we find the so-called execution units. And we must distinguish between to types of unit:

ALU (Arithmetic and Logic Unit)

FPU (Floating Point Unit)

If the processor has a brain, it is the ALU unit. It is the calculating device that does operations on whole numbers (integers). The computer’s work with ordinary text, for example, is looked after by the ALU.

The ALU is good at working with whole numbers. When it comes to decimal numbers and especially numbers with many decimal places (real numbers as they are called in mathematics), the ALU chokes, and can take a very long time to process the operations. That is why an FPU is used to relieve the load. An FPU is a number cruncher, specially designed for floating point operations.

There are typically several ALU’s and FPU’s in the same processor. The CPU also has other operation units, for example, the LSU (Load/Store Unit).

An example sequence

We know that the processor core is right beside the L1 cache. Imagine that an instruction has to be processed:

The processor core fetches a long and complex x86 instruction from the L1 instruction cache.

The instruction is sent into the pipeline where it is broken down into smaller units.

If it is an integer operation, it is sent to an ALU, while floating point operations are sent to an FPU.

After processing the data is sent back to the L1 cache.

This description applies to the working cycle in, for example, the Pentium III and Athlon. As a diagram it might look like this:

The passage of instructions through the pipeline.

But the way the relationship between the pipeline and the execution units is designed differs greatly from processor to processor. So this entire examination should be taken as a general introduction and nothing more.

Pipelines in the Pentium 4

In the Pentium 4, the instruction cache has been placed between the “Instruction fetch/Translate” unit ( the ALU/FPU. Here the instruction cache (Execution Trace Cache) doesn’t store the actual instructions, but rather the “half-digested” micro-ops.

. In the Pentium 4, the instruction cache stores decoded micro instructions.

The actual pipeline in the Pentium 4 is longer than in other CPU’s; it has 20 stages. The disadvantage of the long pipeline is that it takes more clock ticks to get an instruction through it. 20 stages require 20 clock ticks, and that reduces the CPU’s efficiency. This was very clear when the Pentium 4 was released; all tests showed that it was much slower than other processors with the same clock frequency.

At the same time, the cost of reading the wrong instruction (misprediction) is much greater – it takes a lot of clock ticks to fill up the long assembly line again.

The Pentium 4’s architecture must therefore be seen from a longer-term perspective. Intel expects to be able to scale up the design to work at clock frequencies of up to 5-10 GHz. In the “Prescott” version of Pentium 4, the pipeline was increased further to 31 stages.

AMD’s 32 bit Athlon line can barely make it much above a clock frequency of 2 GHz, because of the short pipeline. In comparison, the Pentium 4 is almost ”light years” ahead.

THE L2 CACHE

The level 2 cache is normally much bigger (and unified), such as 256, 512 or 1024 KB. The purpose of the L2 cache is to constantly read in slightly larger quantities of data from RAM, so that these are available to the L1 cache.

In the earlier processor generations, the L2 cache was placed outside the chip: either on the motherboard (as in the original Pentium processors), or on a special module together with the CPU (as in the first Pentium II’s).

. An old Pentium II module. The CPU is mounted on a rectangular printed circuit board, together with the L2 cache, which is two chips here. The whole module is installed in a socket on the motherboard. But this design is no longer used.

As process technology has developed, it has become possible to make room for the L2 cache inside the actual processor chip. Thus the L2 cache has been integrated and that makes it function much better in relation to the L1 cache and the processor core.

The L2 cache is not as fast as the L1 cache, but it is still much faster than normal RAM.

CPU	L2 cache
Pentium, K5, K6	External, on the motherboard
Pentium Pro	Internal, in the CPU
Pentium II, Athlon	External, in a module close to the CPU
Celeron (1st generation)	None
Celeron (later gen.), Pentium III, Athlon XP, Duron, Pentium 4	Internal, in the CPU

It has only been during the last few CPU generations that the level 2 cache has found its place, integrated into the actual CPU.

Traditionally the L2 cache is connected to the front side bus. Through it, it connects to the chipset’s north bridge and RAM:

The way the processor uses the L1 and L2 cache has crucial significance for its utilisation of the high clock frequencies.

The level 2 cache takes up a lot of the chip’s die, as millions of transistors are needed to make a large cache. The integrated cache is made using SRAM (static RAM), as opposed to normal RAM which is dynamic (DRAM).

While DRAM can be made using one transistor per bit (plus a capacitor), it costs 6 transistors (or more) to make one bit of SRAM. Thus 256 KB of L2 cache would require more than 12 million transistors. Thus it has only been since fine process technology (such as 0.13 and 0.09 microns) was developed that it became feasible to integrate a large L2 cache into the actual CPU. In Fig. 66 on page 27, the number of transistors includes the CPU’s integrated cache.

Powerful bus

The bus between the L1 and L2 cache is presumably THE place in the processor architecture which has the greatest need for high bandwidth. We can calculate the theoretical maximum bandwidth by multiplying the bus width by the clock frequency. Here are some examples:

CPU	Bus width	Clock frequency	Theoretical bandwidth
Intel Pentium III	64 bits	1400 MHz	11.2 GB/sek.
AMD Athlon XP+	64 bits	2167 MHz	17.3 GB/sek.
AMD Athlon 64	64 bits	2200 MHz	17,6 GB/sek.
AMD Athlon 64 FX	128 bits	2200 MHz	35,2 GB/sek.
Intel Pentium 4	256 bits	3200 MHz	102 GB/sek.

Theoretical calculations of the bandwidth between the L1 and L2 cache.

Different systems

There are a number of different ways of using caches. Both Intel and AMD have saved on L2 cache in some series, in order to make cheaper products. But there is no doubt, that the better the cache – both L1 and L2 – the more efficient the CPU will be and the higher its performance.

AMD have settled on a fairly large L1 cache of 128 KB, while Intel continue to use relatively small (but efficient) L1 caches.

On the other hand, Intel uses a 256 bit wide bus on the “inside edge” of the L2 cache in the Pentium 4, while AMD only has a 64-bit bus (

CPU’s with very different designs.

AMD uses exclusive caches in all their CPU’s. That means that the same data can’t be present in both caches at the same time, and that is a clear advantage. It’s not like that at Intel.

However, the Pentium 4 has a more advanced cache design with Execution Trace Cache making up 12 KB of the 20 KB Level 1 cache.

CPU	L1 cache	L2 cache
Athlon XP	128 KB	256 KB
Athlon XP+	128 KB	512 KB
Pentium 4 (I)	20 KB	256 KB
Pentium 4 (II, “Northwood”)	20 KB	512 KB
Athlon 64	128 KB	512 KB
Athlon 64 FX	128 KB	1024 KB
Pentium 4 (III, “Prescott”)	28 KB	1024 KB

most common processors and their caches.

Latency

A very important aspect of all RAM – cache included – is latency. All RAM storage has a certain latency, which means that a certain number of clock ticks (cycles) must pass between, for example, two reads. L1 cache has less latency than L2; which is why it is so efficient.

When the cache is bypassed to read directly from RAM, the latency is many times greater. In the number of wasted clock ticks are shown for various CPU’s. Note that when the processor core has to fetch data from the actual RAM (when both L1 and L2 have failed), it costs around 150 clock ticks. This situation is called stalling and needs to be avoided.

Note that the Pentium 4 has a much smaller L1 cache than the Athlon XP, but it is significantly faster. It simply takes fewer clock ticks (cycles) to fetch data:

Latency	Pentium II	Athlon	Pentium 4
L1 cache:	3 cycles	3 cycles	2 cycles
L2 cache:	18 cycles	6 cycles	5 cycles

leads to wasted clock ticks; the fewer there are of these, the faster the processor will appear to be.

Intelligent ”data prefetch”

In CPU’s like the Pentium 4 and Athlon XP, a handful of support mechanisms are also used which work in parallel with the cache. These include:

A hardware auto data prefetch unit, which attempts to guess which data should be read into the cache. This device monitors the instructions being processed and predicts what data the next job will need.

Related to this is the Translation Look-aside Buffer, which is also a kind of cache. It contains information which constantly supports the supply of data to the L1 cache, and this buffer is also being optimised in new processor designs. Both systems contribute to improved exploitation of the limited bandwidth in the memory system.

The WCPUID program reports on cache in an Athlon processor.

Conclusion

L1 and L2 cache are important components in modern processor design. The cache is crucial for the utilisation of the high clock frequencies which modern process technology allows. Modern L1 caches are extremely effective. In about 96-98% of cases, the processor can find the data and instructions it needs in the cache. In the future, we can expect to keep seeing CPU’s with larger L2 caches and more advanced memory management. As this is the way forward if we want to achieve more effective utilisation of the CPU’s clock ticks. Here is a concrete example:

In January 2002 Intel released a new version of their top processor, the Pentium 4 (with the codename, “Northwood”). The clock frequency had been increased by 10%, so one might expect a 10% improvement in performance. But because the integrated L2 cache was also doubled from 256 to 512 KB, the gain was found to be all of 30%.

CPU	L2 cache	Clock freq.	Improvement
Intel Pentium 4 (0.18 micron)	256 KB	2000 MHz
Intel Pentium 4 (0.13 micron)	512 KB	2200 MHz	+30%

Because of the larger L2 cache, performance increased significantly.

In 2002 AMD updated the Athlon processor with the new ”Barton” core. Here the L2 cache was also doubled from 256 to 512 KB in some models. In 2004 Intel came with the “Prescott” core with 1024 KB L2 cache, which is the same size as in AMD’s Athlon 64 processors. Some Extreme Editions of Pentium 4 even uses 2 MB of L2 cache.

Xeon for servers

Intel produces special server models of their Pentium III and Pentium 4 processors. These are called Xeon, and are characterised by very large L2 caches. In an Intel Xeon the 2 MB L2 cache uses 149,000,000 transistors.

Xeon processors are incredibly expensive (about Euro 4,000 for the top models), so they have never achieved widespread distribution.

They are used in high-end servers, in which the CPU only accounts for a small part of the total price.

Otherwise, Intel’s 64 bit server CPU, the Itanium. The processor is supplied in modules which include 4 MB L3 cache of 300 million transistors.

Multiprocessors

Several Xeon processors can be installed on the same motherboard, using special chipsets. By connecting 2, 4 or even 8 processors together, you can build a very powerful computer.

These MP (Multiprocessor) machines are typically used as servers, but can also be used as powerful workstations, for example, to perform demanding 3D graphics and animation tasks. AMD has the Opteron processors, which are server-versions of the Athlon 64. Not all software can make use of the PC’s extra processors; the programs have to be designed to do so. For example, there are professional versions of Windows NT, 2000 and XP, which support the use of several processors in one PC.

See also the discussion of Hyper Threading, which allows a Pentium 4 processor to appear as an MP system. Both Intel and AMD also works on dual-core processors.

THE CACHE

Speed conflict

The CPU works internally at very high clock frequencies (like 3200 MHz), and no RAM can keep up with these.

The most common RAM speeds are between 266 and 533 MHz. And these are just a fraction of the CPU’s working speed. So there is a great chasm between the machine (the CPU) which slaves away at perhaps 3200 MHz, and the “conveyor belt”, which might only work at 333 MHz, and which has to ship the data to and from the RAM. These two subsystems are simply poorly matched to each other.

If nothing could be done about this problem, there would be no reason to develop faster CPU’s. If the CPU had to wait for a bus, which worked at one sixth of its speed, the CPU would be idle five sixths of the time. And that would be pure waste.

The solution is to insert small, intermediate stores of high-speed RAM. These buffers (cache RAM) provide a much more efficient transition between the fast CPU and the slow RAM. Cache RAM operates at higher clock frequencies than normal RAM. Data can therefore be read more quickly from the cache.

Data is constantly being moved

The cache delivers its data to the CPU registers. These are tiny storage units which are placed right inside the processor core, and they are the absolute fastest RAM there is. The size and number of the registers is designed very specifically for each type of CPU.

Cache RAM is much faster than normal RAM.

The CPU can move data in different sized packets, such as bytes (8 bits), words (16 bits), dwords (32 bits) or blocks (larger groups of bits), and this often involves the registers. The different data packets are constantly moving back and forth:

from the CPU registers to the Level 1 cache.

from the L1 cache to the registers.

from one register to another

from L1 cache to L2 cache, and so on…

The cache stores are a central bridge between the RAM and the registers which exchange data with the processor’s execution units.

The optimal situation is if the CPU is able to constantly work and fully utilize all clock ticks. This would mean that the registers would have to always be able to fetch the data which the execution units require. But this it not the reality, as the CPU typically only utilizes 35% of its clock ticks. However, without a cache, this utilization would be even lower.

Bottlenecks

CPU caches are a remedy against a very specific set of “bottleneck” problems. There are lots of “bottlenecks” in the PC – transitions between fast and slower systems, where the fast device has to wait before it can deliver or receive its data. These bottle necks can have a very detrimental effect on the PC’s total performance, so they must be minimised.

. A cache increases the CPU’s capacity to fetch the right data from RAM.

The absolute worst bottleneck exists between the CPU and RAM. It is here that we have the heaviest data traffic, and it is in this area that PC manufacturers are expending a lot of energy on new development. Every new generation of CPU brings improvements relating to the front side bus.

The CPU’s cache is “intelligent”, so that it can reduce the data traffic on the front side bus. The cache controller constantly monitors the CPU’s work, and always tries to read in precisely the data the CPU needs. When it is successful, this is called a cache hit. When the cache does not contain the desired data, this is called a cache miss.

Two levels of cache

The idea behind cache is that it should function as a “near store” of fast RAM. A store which the CPU can always be supplied from.

In practise there are always at least two close stores. They are called Level 1, Level 2, and (if applicable) Level 3 cache. Some processors (like the Intel Itanium) have three levels of cache, but these are only used for very special server applications. In standard PC’s we find processors with L1 and L2 cache.

The cache system tries to ensure that relevant data is constantly being fetched from RAM, so that the CPU (ideally) never has to wait for data.

L1 cache

Level 1 cache is built into the actual processor core. It is a piece of RAM, typically 8, 16, 20, 32, 64 or 128 Kbytes, which operates at the same clock frequency as the rest of the CPU. Thus you could say the L1 cache is part of the processor.

L1 cache is normally divided into two sections, one for data and one for instructions. For example, an Athlon processor may have a 32 KB data cache and a 32 KB instruction cache. If the cache is common for both data and instructions, it is called a unified cache.

MOORE'S LAW

This development was actually described many years ago, in what we call Moores Law.

Right back in 1965, Gordon Moore predicted (in the Electronics journal), that the number of transistors in processors (and hence their speed) would be able to be doubled every 18 months.

Moore expected that this regularity would at least apply up until 1975. But he was too cautious; we can see that the development continues to follow Moores Law today, as is shown in Fig.

. In 1968, Gordon Moore helped found Intel.

If we try to look ahead in time, we can work out that in 2010 we should have processors containing 3 billion transistors. And with what clock frequencies? You’ll have to guess that for yourself.

. Moores Law

Process technology

The many millions of transistors inside the CPU are made of, and connected by, ultra thin electronic tracks. By making these electronic tracks even narrower, even more transistors can be squeezed into a small slice of silicon.

The width of these electronic tracks is measured in microns (or micrometers), which are millionths of a metre.

For each new CPU generation, the track width is reduced, based on new technologies which the chip manufacturers keep developing. At the time of writing, CPU’s are being produced with a track width of 0.13 microns, and this will be reduced to 0.09 and 0.06 microns in the next generations.

CPU’s are produced in extremely high-technology environments (“clean rooms”). Photo courtesy of AMD.

In earlier generations aluminium was used for the current carrying tracks in the chips. With the change to 0.18 and 0.13-micron technology, aluminium began to be replaced with copper. Copper is cheaper, and it carries current better than aluminium. It had previously been impossible to insulate the copper tracks from the surrounding silicon, but IBM solved this problem in the late 1990’s.

AMD became the first manufacturer to mass-produce CPU’s with copper tracks in their chip factory fab 30 in Dresden, Germany. A new generation of chips requires new chip factories (fabs) to produce it, and these cost billions of dollars to build. That’s why they like a few years to pass between each successive generation. The old factories have to have time to pay for themselves before new ones start to be used.

AMD’s Fab 30 in Dresden, which was the first factory to mass-produce copper-based CPU’s.

A grand new world …

We can expect a number of new CPU’s in this decade, all produced in the same way as they are now – just with smaller track widths. But there is no doubt that we are nearing the physical limits for how small the transistors produced using the existing technology can be. So intense research is underway to find new materials, and it appears that nanotransistors, produced using organic (carbon-based) semiconductors, could take over the baton from the existing process technology.

Bell Labs in the USA has produced nanotransistors with widths of just one molecule. It is claimed that this process can be used to produce both CPU’s and RAM circuits up to 1000 times smaller than what we have today!

Less power consumption

The types of CPU’s we have today use a fairly large amount of electricity when the PC is turned on and is processing data. The processor, as you know, is installed in the motherboard, from which it receives power. There are actually two different voltage levels, which are both supplied by the motherboard:

One voltage level which powers the CPU core (kernel voltage).

Another voltage level which powers the CPU’s I/O ports, which is typically 3.3 volts.

As the track width is reduced, more transistors can be placed within the same area, and hence the voltage can be reduced.

As a consequence of the narrower process technology, the kernel voltage has been reduced from 3 volts to about 1 volt in recent years. This leads to lower power consumption per transistor. But since the number of transistors increases by a corresponding amount in each new CPU generation, the end result is often that the total power consumption is unchanged.

powerful fan. Modern CPU’s require something like this.

It is very important to cool the processor; a CPU can easily burn 50-120 Watts. This produces a fair amount of heat in a very small area, so without the right cooling fan and motherboard design, a Gigahertz processor could quickly burn out.

Modern processors contain a thermal diode which can raise the alarm if the CPU gets to hot. If the motherboard and BIOS are designed to pay attention to the diode’s signal, the processor can be shut down temporarily so that it can cool down.

The temperatures on the motherboard are constantly reported to this program..

Cooling is a whole science in itself. Many “nerds” try to push CPU’s to work at higher clock speeds than they are designed for. This is often possible, but it requires very good cooling – and hence often huge cooling units.

30 years development

Higher processor speeds require more transistors and narrower electronic tracks in the silicon chip. In the overview in Fig. you can see the course of developments in this area.

Note that the 4004 processor was never used for PC’s. The 4004 was Intel’s first commercial product in 1971, and it laid the foundation for all their later CPU’s. It was a 4-bit processor which worked at 108 KHz (0.1 MHz), and contained 2,250 transistors. It was used in the first pocket calculators, which I can personally remember from around 1973-74 when I was at high school. No-one could have predicted that the device which replaced the slide rule, could develop, in just 30 years, into a Pentium 4 based super PC.

If, for example, the development in automobile technology had been just as fast, we would today be able to drive from Copenhagen to Paris in just 2.8 seconds!

Year	Intel CPU	Technology (track width)
1971	4004	10 microns
1979	8088	3 microns
1982	80286	1.5 microns
1985	80386	1 micron
1989	80486	1.0/0.8 microns
1993	Pentium	0.8/0.5/0.35 microns
1997	Pentium II	0.28/0.25 microns
1999	Pentium III	0.25/0.18/0.13 microns
2000-2003	Pentium 4	0.18/0.13 microns
2004-2005	Pentium 4 ”Prescott”	0.09 microns

The high clock frequencies are the result of new process technology with smaller electronic ”tracks”.

A conductor which is 0.09 microns (or 90 nanometres) thick, is 1150 times thinner than a normal human hair. These are tiny things we are talking about here.

Wafers and die size

Another CPU measurement is its die size. This is the size of the actual silicon sheet containing all the transistors

At the chip factories, the CPU cores are produced in so-called wafers. These are round silicon sheets which typically contain 150-200 processor cores (dies).

The smaller one can make each die, the more economical production can become. A big die is also normally associated with greater power consumption and hence also requires cooling with a powerful fan.

A technician from Intel holding a wafer. This slice of silicon contains hundreds of tiny processor cores, which end up as CPU’s in everyday PC’s.

You can see the measurements for a number of CPU’s below. Note the difference, for example, between a Pentium and a Pentium II. The latter is much smaller, and yet still contains nearly 2½ times as many transistors. Every reduction in die size is welcome, since the smaller this is, the more processors can fit on a wafer. And that makes production cheaper.

CPU	Track width	Die size	Number of transistors
Pentium	0.80	294 mm²	3.1 mil.
Pentium MMX	0.28	140 mm²	4.5 mil.
Pentium II	0.25	131 mm²	7.5 mil.
Athlon	0.25	184 mm²	22 mil.
Pentium III	0.18	106 mm²	28 mil.
Pentium III	0.13	80 mm²	28 mil.
Athlon XP	0.18	128 mm²	38 mil.
Pentium 4	0.18	217 mm²	42 mil.
Pentium 4	0.13	145 mm²	55 mil.
Athlon XP+	0.13	115 mm²	54 mil.
Athlon 64 FX	0,13	193 mm²	106 mill.
Pentium 4	0.09	112 mm²	125 mil.

The smaller the area of each processor core, the more economical chip production can be.

The modern CPU generations

As mentioned earlier, the various CPU’s are divided into generations

At the time of writing, we have started on the seventh generation. Below you can see the latest processors from Intel and AMD, divided into these generations. The transitions can be a bit hazy. For example, I’m not sure whether AMD’s K6 belongs to the 5th or the 6^th generation. But as a whole, the picture is as follows:

Generation	CPU’s
5th	Pentium, Pentium MMX, K5, K6
6th	Pentium Pro, K6-II, Pentium II, K6-3, Athlon, Pentium III
7th	Pentium 4, Athlon XP
8th.	Athlon 64 FX, Pentium 5

latest generations of CPU’s.

Powered By

Saturday, May 31, 2008

16 bits – the 8086, 8088 and 80286

Different systems

About Me

Labels

Blog Archive