Dr. Tom Butler 2004
Pentium Generations: A History and Technical Overview of Intel’s Pentium Processors
The 8086, Intel’s first generation 16 bit processor was introduced in the late 1970s. Its initial business uses were limited to dedicated word processing machines and low-end mini-computers. However, in the early 1980s, Compaq adopted the CPU for use in its new Desk Pro range, which consisted of cloned versions of IBM/XT range of personal computers. Intel enhanced the 8086 with the introduction of the 80186 and 80188 CPUs, and added a math coprocessor called a floating point unit (FPU) in the 8087 Intel coprocessor. Intel’s second generation 80286 appeared in 1982 and formed the core of the first powerful PC, the IBM PC /AT. The Intel 80287 provided FPU co-processing functionality in high-end platforms. While the i286 provided protected mode operation and up to 16 MB of RAM, Intel’s third generation CPU, the i386 was its first 32-bit CPU. This came with 16-33 MHz core processor and system bus speeds and had an accompanying 80387 FPU. A variant of the 386, the SX was the first Intel chip to have an internal L1 cache. Intel’s forth generation family appeared in 1989 with the release of the i486. This 32-bit CPU had 8KB of L1 cache and a built-in FPU. While initially running at speeds of between 20-50 MHz internally and externally (i.e. core CPU and bus speeds), the release of i486 DX/2 in 1992 saw the bus speeds multiplied by 2 for internal core CPU operation (25/50, 33/66 and 40/80MHz). These speeds were tripled in 1994 with the release of the i486 DX/4 (a slight of hand by Intel, as the CPUs ran at 25/75, 33/100 and 40/120 MHz): importantly, the 486’s L1 cache was doubled to 16 KB in and 8KB instruction + 8KB data configuration.
The Pentium P54 was released in 1994. Running at core speeds of 50, 60 and 66 MHz, it was the first CPU to have dual data and instruction L1 caches (8+8KB).The Pentium microprocessor was the last of Intel’s 5th generation microprocessors and had several basic units: the Bus Interface Unit (BIU); the I-Cache (8 KB of write-through Static RAM—SRAM); the Instruction Translation Lookaside Buffer (TLB); The D-Cache (8KB of write-back SRAM); the Data TLB; the Clock Driver/Multiplier; Instruction Fetch Unit; the Branch Prediction Unit; the Instruction Decode Unit; Complex Instruction Support Unit; Superscalar Integer Execution Unit; Pipelined Floating Point Unit. Figure 5 presents a block diagram of the original Pentium.
The Pentium was the first Intel chip to have a 64 bit external data bus which was split internally into two separate pipelines, each 32 bits wide. This allowed the Pentium to execute two instructions simultaneously; however, more than one instruction could be in the pipeline, thus increasing instruction throughput.
Heat dissipation is enemy of chip designers, as the greater the number of integrated transistors, the higher the speed of operation and the operating voltage, the more poser is consumed, and the more heat generated. The first two Pentium versions (P54 and P54C—released in 1994) ran at 60 and 66 MHz respectively with an operating voltage of 5 V DC. Hence they ran quite hot. However, a change in package design (from Socket 5 to 7, Pin Grid Array—PGA) and a reduction in operating voltage to 3.3 Volts in the Pentium P54C lowered power consumption and heat dissipation. Intel also introduced a clock multiplier which multiplied the external clock signals and enabled the Pentium to run at 1.5, 2, 2.5 and finally 3 times this speed. Thus while the system bus ran at 50, 60, and 66 MHz, the CPU ran at 75-200MHz.
In 1997, Intel changed the Pentium design in several ways, the most significant was the inclusion of an MMX unit (multi media extension) and 16 KB instruction and data caches. The MMX unit contained eight new 64 bit registers and 57 ‘simple’ hardwired MMX instructions that operated on 4 new data types. The internal architecture and external operation of the Pentium family evolved from the Pentium MMX, with the Pentium Pro, Pentium II and Pentium III. However, major design changes came with the Pentium IV. Modifications and design changes centered on (a) the physical package; (b) the process by which instructions were decoded and executed; (c) support for memory beyond the 4 GB limit; (c) the integration and enhancement of L1 and L2 cache performance and size; (d) the addition of a new cache; (e) the speed of internal and external operation. Each of these issues receives attention in the following subsections.
Two terms are employed to describe the packaging employed for the Pentium family of processors: the first refers to the motherboard connection, and the second to the actual package itself. For example, the original Pentium P5 was fitted to the Socket 5 type connection on the motherboard using a Staggered Pin Grid Array (SPGA) for the die’s I/O (die is the technical term for the physical structure that incorporates the chip). Later variants used the Socket 7 connector. The Pin Grid Array (PGA) family of packages is associated with different Socket types. A pin grid array is simply an array of metal pin connectors used to form an electrical connection between the internal electronics of the CPU (packaged on the die) and other system components like the system chipsets. The pins plug into corresponding receptacle pinholes in the CPU’s socket on the motherboard. The different types of PGA reflect the type of packaging, e.g. ceramic to plastic, the number of pins, and how they are arrayed. The Pentium Pro used a SPGA with a staggering 387 pins for connection to the motherboard socket, called Socket 8. The Pentium Pro was the first Intel processor to have an L2 cache connected to the CPU via backside bus, but on a separate die. This was a significant technical achievement packaging. When Intel designed the Pentium II they decided to change the packaging significantly and introduced a Single Edge Contact Connector (SECC) package (with three variants SECC for the Pentium II, SECC2 for the Pentium II and SEPP for the Celeron), each of which plugged into the Slot 1 connector on the motherboard. However, later variants of the Celeron and Pentium III used PGA packaging for certain applications: the Celeron uses the Plastic PGA, the Celeron III and Pentium III the Flip-Chip Pin Grid Array (FC-PGA). Both use the 370-pin Socket. The Pentium IV saw a full return to the PGA for all chips. Here a Flip-Chip Pin Grid Array (FC-PGA) was employed in a 478 PCPGA package.
Figure 1 Pentium CPU Block Diagram
Overall Architectural Comparison of the Pentium Family of Microprocessors
The Pentium (P54) was first shipped in 1993 and had 3.1 million transistors. It used a 5 Volt to power its core and I/O logic, PGA on Socket 4, had a 2x8kb L1 cache, and operated at 50, 60 and 66 MHz. The system bus also operated at these speeds. The Pentium (P54C) was released in 1994 and had PGA on Socket 5 and 7, 3.3 Volts supply for core and I/O logic. It was also the first to use a multiplier to give processor speeds of 75, 90,100,120,133, 150, 166 and 200 MHz. The last version of the first member of this sub-generation was the Pentium MMX (P55C). This had 4.1 million transistors, fit Socket 7, and had a 2 x 16 KB L1 cache with improved branch prediction logic. It operated at 2.8 V for its core logic and 3.3V for I/O logic. Its 60 and 66 MHz system clock speed was multiplied on board the CPU to give between 120-300MHz CPU clock speeds. Overall features included:
Superscalar architecture: Two integer (U (slow) and V (fast)) and one floating point pipelines. The U and V pipelines contained five stages of instruction execution, while the floating point pipeline had 8 stages. The U and V pipelines were served by two 32 byte prefetch buffers. This allowed overlapping execution of instructions in the pipelines.
Dynamic branch prediction used the Branch Target Buffer. The Pentium’s branch prediction logic helped speed up program execution by anticipating branches and ensuring that branched-to code was available in cache
An Instruction and a Data Cache each of 8 Kbyte capacity
A 64 bit system data bus and 32 bit address bus
Dual processing capability
On-board Advanced Programmable Interrupt Controller
The Pentium MMX version contained an additional MMX unit that speeds up multimedia and 3D applications. Processing multimedia data involved instructions operating on large volumes of packetized data. Intel proposed a new approach: single instruction multiple data, which could operate on video pixels or Internet audio streams. The MMX unit contained eight new 64 bit registers and 57 ‘simple’ hardwired MMX instructions that operate on 4 new data types. To leverage the features of the MMX unit, applications were programmed to include the new instructions.
Pentium Pro: An Architecture for 6th Generation Processors
The Pentium Pro was designed around a the 6th generation P6 architecture, which was optimized for 32 bit instructions and 32-bit operating systems such as Windows NT and Linux. It was the first of the P6 family, which included the Pentium II, the Celeron variants, and the Pentium III. The physical package was also significant advance, as was the incorporation of additional RISC features. However, aimed as it was at the server market, the Pentium Pro did not incorporate MMX technology. It was expensive to produce as it included the L2 cache on its substrate (but on a separate die) and had 5.5 million transistors at its core and over 8 million in its L2 cache. Its core logic operated at 3.3Volts. The microprocessor was still, however, chiefly CISC in design, and optimized for 32 bit operation. The chief features of the Pentium Pro were:
A partly integrated L2 cache of up to 512 KB (on a specially manufactured SRAM separate die) that was connected via a dedicated ‘backside’ bus that ran at full CPU speed.
Three 12 staged pipelines
Speculative execution of instructions
Out-of-order completion of instructions
40 renamed registers
Dynamic branch prediction
Multiprocessing with up to 4 Pentium Pros
An increased bus size to 36 bits (from 32) to enable up to 64 Gb of memory to be used. (Please note that the 4 extra bits can address up to 16 memory locations; this gives 4 Gb x 16 = 64 Gb of memory.)
The following description is taken from Intel’s introduction to its microprocessor architecture is relevant to all members of the P6 family, including the Celeron, Pentium II and III. The Intel Pentium Pro processor had three-way superscalar architecture. The term “three-way superscalar” means that using parallel processing techniques, the processor is able on average to decode, dispatch, and complete execution of (retire) three instructions per clock cycle. To handle this level of instruction throughput, the Pentium Pro processor used a decoupled, 12-stage superpipeline that supports out-of-order instruction execution. It did this by incorporating even more parallelism than the Pentium processor. The Pentium Pro processor provided Dynamic Execution (micro-data flow analysis, out-of-order execution, superior branch prediction, and speculative execution) in a superscalar implementation.
The centerpiece of the Pentium Pro processor architecture was an innovative out-of-order execution mechanism called “dynamic execution.” Dynamic execution incorporates three data-processing concepts:
• Deep branch prediction.
• Dynamic data flow analysis.
• Speculative execution.
Branch prediction is a concept found in most mainframe and high-speed RISC microprocessor architectures. It allows the processor to decode instructions beyond branches to keep the instruction pipeline full. In the Pentium Pro processor, the instruction fetch/decode unit used a highly optimized branch prediction algorithm to predict the direction of the instruction stream through multiple levels of branches, procedure calls, and returns.
Figure 2 Functional Block Diagram of the Pentium Pro Processor Micro-architecture
Dynamic data flow analysis involves real-time analysis of the flow of data through the processor to determine data and register dependencies and to detect opportunities for out-of-order instruction execution. The Pentium Pro processor dispatch/execute unit can simultaneously monitor many instructions and execute these instructions in the order that optimizes the use of the processor’s multiple execution units, while maintaining the integrity of the data being operated on. This out-of-order execution keeps the execution units busy even when cache misses and data dependencies among instructions occur.
Speculative execution refers to the processor’s ability to execute instructions ahead of the program counter, but ultimately to commit the results in the order of the original instruction stream. To make speculative execution possible, the Pentium Pro processor microarchitecture decoupled the dispatching and executing of instructions from the commitment of results. The processor’s dispatch/execute unit used data-flow analysis to execute all available instructions in the instruction pool and store the results in temporary registers. The retirement unit then linearly searched the instruction pool for completed instructions that no longer had data dependencies with other instructions or unresolved branch predictions. When completed instructions were found, the retirement unit commited the results of these instructions to memory and/or the Intel Architecture registers (the processor’s eight general-purpose registers and eight floating-point unit data registers) in the order they were originally issued and retired the instructions from the instruction pool.
Through deep branch prediction, dynamic data-flow analysis, and speculative execution, dynamic execution removed the constraint of linear instruction sequencing between the traditional fetch and execute phases of instruction execution. It allowed instructions to be decoded deep into multi-level branches to keep the instruction pipeline full. It promoted out-of-order instruction execution to keep the processor’s six instruction execution units running at full capacity. And finally it committed the results of executed instructions in original program order to maintain data integrity and program coherency.
Three instruction decode units worked in parallel to decode object code into smaller operations called “micro-ops” (microcode). These went into an instruction pool, and (when interdependencies don’t prevent) were executed out of order by the five parallel execution units (two integer, two FPU and one memory interface unit). The Retirement Unit retired completed micro-ops in their original program order, taking account of any branches.
The power of the Pentium Pro processor was further enhanced by its caches: it had the same two on-chip 8-KByte L1 caches as did the Pentium processor, and also had a 256-512 KByte L2 cache that was in the same package as, and closely coupled to, the CPU, using a dedicated 64-bit (“backside”) full clock speed bus. The L1 cache was dual ported, the L2 cache supported up to 4 concurrent accesses, and the 64-bit external data bus was transaction-oriented, meaning that each access was handled as a separate request and response, with numerous requests allowed while awaiting a response. These parallel features for data access worked with the parallel execution capabilities to provide a “non-blocking” architecture in which the processor was more fully utilized and performance is enhanced.
Pentium Pro Modes of Operation
The Intel I-32 Architecture supports three operating modes: protected mode, real-address mode, and system management mode. The operating mode determines which instructions and architectural features are accessible:
Protected mode. The native state of the processor. In this mode all instructions and architectural features are available, providing the highest performance and capability. This is the recommended mode for all new applications and operating systems. Among the capabilities of protected mode is the ability to directly execute “real-address mode” 8086 software in a protected, multi-tasking environment. This feature is called virtual-8086 mode, although it is not actually a processor mode. Virtual-8086 mode is actually a protected mode attribute that can be enabled for any task.
Real-address mode. Provides the programming environment of the Intel 8086 processor with a few extensions (such as the ability to switch to protected or system management mode). The processor is placed in real-address mode following power-up or a reset.
System management mode. A standard architectural feature unique to all Intel processors, beginning with the Intel386 SL processor. This mode provides an operating system or executive with a transparent mechanism for implementing platform-specific functions such as power management and system security. The processor enters SMM when the external SMM interrupt pin (SMI#) is activated or an SMI is received from the advanced programmable interrupt controller (APIC). In SMM, the processor switches to a separate address space while saving the entire context of the currently running program or task. SMM-specific code may then be executed transparently. Upon returning from SMM, the processor is placed back into its state prior to the system management interrupt.
The basic execution environment is the same for each of these operating modes,
The Pentium II incorporated many of the salient features of the Pentium Pro and Pentium MMX; however, its physical package was based on the SECC/Slot 1 interface and its 512 KB L2 cache ran at only half the processor internal clock rate. First generation Pentium II Klamath CPUs operated at 233, 266, 300 and 333MHz with a FSB of 66MHz and a core voltage of 2.8 Volts. In 1998, Intel introduced the Pentium II Deschutes that operated at a speed of 350, 400 and 450 MHz with a 100 MHz, and later 66MHz, FSB and at 2.0 Volts at the core. Its major improvements were:
16 Kb L1 instruction and data caches
L2 cache with non-proprietary commercially available SRAM
Improved 16 bit capability through segment register caches
Standard Pentium II could only be used in dual multiprocessor configurations; however, Pentium Xeon Processors had up to 2 MB of L2 cache and could be used in multiprocessor configurations of up to 4 processors.
The Celeron began as a scaled down version of the Pentium II and was designed to compete against similar offerings from Intel’s competitors. The Klamath-based Celeron Covington core ran at 266 and 300 MHz and were constructed without an L2 cache. However, adverse market reaction saw the Deschutes-based Mendocino core introduced with an 128 Kb L2 cache and ran at 300, 333, 400, 433, 466, 500 and 533 MHz. Celerons had the same L1 cache as their bigger brothers—Pentium II and III. The important distinction is that the L2 cache operated at full CPU clock rates, unlike the Pentium II and the SECC packaged Pentium III. (Later variants of the Pentium III had an on-die L2 cache which ran at full CPU clock rate. The Celeron III (Coppermine128 core) had the same internal features as the Pentium III, but has reduced functionality: 66 MHz clock rate, no error correction codes for the data bus, and parity creation for the address bus, and a maximum of 4 GB of address space. Celeron III Coppermine128s with a 1.6 V core and a 100 MHz were produced in 2001 and operated at core speeds of up to 1.1 MHz. Tualatin-core Celerons were put on the market in late 2001 and ran at 1.2 GHz. 2002 saw the final versions produced running aty 1.3 and 1.4 MHz.
The only significant difference between the Pentium III and its predecessor was the inclusion of 72 MMX instructions, known as the Internet Streaming Single Instruction Multiple Data Extensions (ISSE), they include integer and floating point operations. However, like the original MMX instructions, application programmers must include the corresponding extensions if any use is to be made of these instructions. The most controversial and short-lived addition was the CPU ID number which could be used for software licensing and e-commerce. After protest from various sources, Intel disabled it as default, but did not remove it. Depending on the BIOS and motherboard manufacturer, it may remain as such but it can be enabled via the BIOS. In reality, Pentium III performance was based. The three variants of Pentium III were the were the Katami, Coppermine, and Tualatin. Katami first introduced the ISSE (MMX/2) as described with an FSB of 100 MHZ. The Coppermine also introduced Advanced Transfer Cache (ATC) for the L2 cache which reduced cache capacity to 256 KB but saw the cache run at full processor speed. Also the 64-bit Katami cache bus was quadrupled to 256 bits. Coppermine also uses an 8-way set associative cache, rather than the 4-way set associative cache in the Katami and older Pentiums. Bringing the cache on-die also increased the transistor count to 30 million, from the 10 million on the Katami. Another advance in the Coppermine was Advanced System Buffering (ASB), which simply increased the number of buffers to account for the increased FSB speed of 133 MHz. The Pentium III Tualatin had a reduced die size that allowed it to run at higher speeds. Tualatins use a 133MHz FSB and have ATC and ASB.