The intel pentium processor

Abstract:-

This term paper describes about the performance of the Intel Pentium processor. It also describes about the software utilization techniques & future generations of the IA-32 high-performance processors. The information on performance results, tools & techniques for software optimization will enable managers, architects & engineers to deliver industry leading software performance. This paper also describes about the evolution of Pentium processors over different period of time.

Introduction:-

Throughout the history of Intel IA-32 processors, the early life cycle of each micro-architecture generation delivered a large performance gain over time. However, as the micro-architectural design matures, the performance gain starts to diminish, & a new micro-architecture is required to maintain the performance trajectory expected by the marketplace. The Intel Net Burst micro-architecture is the latest micro- architecture from Intel that implements the IA-32 architecture. The Intel Net Burst micro-architecture, along with the several extensions to the IA-32 architecture that allows the Pentium processor to deliver the next-generation performance which needed to enhance the experience of PC & workstation users for multimedia & many internet applications & much more. This new micro- architecture enables the performance to enhance efficiently to higher frequencies & it is the foundation for IA-32 processors to deliver industry leading performance for the next various several years after years.

Designed For Performance

A focused architectural pre-design effort was undertaken to assess the benefits of many advanced processor technologies & to determine the best approach to improve the overall performance of the processor for many years to come. The result of the architectural effort was the implementation of a design that significantly increased frequency capabilities to well above 40% higher than that of the micro-architecture of the Pentium III processor, known as P6 micro-architecture, on the same manufacturing process. At the same time, this design effort focused on delivering an average instruction executed per clock (IPC) that was within approximately 10% to 20% of the P6 micro-architecture. The design effort focused on the following:

The result of this design is a effort of new micro-architecture that delivers significantly higher levels of performance & frequency, & provides frequency head room for future IA-32 processor in the next several years. The design innovations of the Intel Net Burst micro-architecture is first realized in the Pentium 4 processor.

Intel Net Burst Micro-Architecture

The Pentium 4 processor, utilizing the Intel Net Burst micro-architecture, is a complete processor redesign that delivers new technologies & capabilities while advancing many of the innovative features, such as “out-of-order speculative execution” & “super-scalar execution,” introduced on prior Intel architecture generations. Many of these innovations & advances were made possible with the improvements in processor technology, transistor technology & circuit design, & they could not have been implemented previously in high-volume, manufacturable solutions. The new technologies & innovative features that are introduced in the Intel Net Burst micro-architecture are listed below:-

Hyper-Pipelined Technology: The hyper-pipelined technology of the Net Burst micro-architecture doubles the pipeline depth, compared to the P6 micro-architecture, with a 20-stage pipeline. This technology significantly increases processor performance & frequency scalability of the base micro-architecture.

400-MHz System Bus: Through a physical signaling scheme of quad pumping the data transfers over a 100 MHz clocked system bus & a buffering scheme allowing for sustained 400-MHz data transfers, the Pentium 4 processor supports the industry's highest performance desktop system bus delivering a data rate of 3.2 Giga-Bytes per second (GB/s) in & out of the processor. This compares to 1.06 GB/s delivered on the Pentium III processor's 133-MHz system bus.

Advanced Dynamic Execution: The Advanced Dynamic Execution engine is a very deep, out-of-order speculative execution engine that keeps the execution units busy. It does so by providing a very large window of instructions from which the execution units can choose in order to get around stalls due to instructions that are not ready to execute based on some unmet dependency (such as waiting for data to be loaded from main memory). The NetBurst micro-architecture can have up to 126 instructions in this window (in flight) versus the P6 micro-architecture's much smaller window of 42 instruction .

The Advanced Dynamic Execution engine also delivers an enhanced branch prediction capability that allows the processor to be more accurate in predicting program branches & has the net effect of reducing the number of branch mispredictions by about 33% over the P6 micro-architecture's branch prediction capability. It does this by implementing a 4 Kilo Bytes (KB) branch target buffer in which to store more detail on the history of past branches as well as implementing a more advanced branch prediction algorithm. This enhanced branch prediction capability is one of the key design elements that helps to reduce the overall sensitivity to branch mis-prediction penalty of the Net Burst micro-architecture .

Rapid Execution Engine: Through a combination of architectural, physical & circuit designs, the Arithmetic Logic Units (ALUs) within the processor run at two times the frequency of the processor core. This allows the ALUs to execute certain instructions in ½ a core clock & results in higher execution throughput as well as reduced latency of execution.

Advanced Transfer Cache: The level 2 Advanced Transfer Cache is 256KB in size & delivers a much higher data throughput channel between the level 2 cache & the processor core. The Advanced Transfer Cache consists of a 256-bit (32-byte) interface that transfers data on each core clock. As a result, a 1.5-GHz Pentium 4 processor could deliver a data transfer rate of 48GB/s (32 bytes x 1 (data transfer per clock) x 1.5 GHz = 48GB/s). This compares to a transfer rate of 16GB/s on the Pentium III processor 1 GHz & contributes to the processor's ability to keep the high-frequency execution units busy executing instructions instead of sitting idle .

Execution Trace Cache: The Execution Trace Cache is an innovative way to implement a 1 instruction cache. It caches decoded IA-32 instructions (or micro-ops), thus removing the latency associated with the instruction decoder from the main execution loops. In addition, the Execution Trace Cache stores these micro-ops in the path of program execution flow, where the results of branches in the code are integrated into the same cache line. This increases the instruction flow from the cache & makes better use the overall cache storage space (12K micro-ops) since the cache no longer stores instructions that are branched over & never executed. The net result is a means to deliver a high volume of instructions to the processor's execution units & a reduction in the overall time required to recover from branches that have been mispredicted.

Streaming SIMD Extensions 2 (SSE2): With the introduction of the SSE2 extensions, the Net Burst micro- architecture now extends the SIMD capabilities of Intel technology & the SSE extensions by adding 144 new instructions that perform 128-bit SIMD integer arithmetic operations & 128-bit SIMD double-precision floating-point (FP) operations. These new instructions provide programmers with new abilities to execute a particular program task on Pentium 4 processors with fewer instructions & in less time. As a result using SSE2 extension can contribute significantly to an overall performance increase.

Hardware Prefetcher: The automatic hardware prefetcher operates transparently without requiring programmer's active intervention. It is triggered by regular access patterns & helps predict future accesses, thereby overlapping memory latency with computation. By enabling concurrency between memory accesses & computation, this maximizes the computational benefit of higher Pentium 4 processor frequencies .

Desktop Performance Expectations

The scalability of application performance with higher processor frequencies vary greatly across applications. This is because different applications have different requirements & are coded differently. Application code can be divided into the following categories: integer & basic office productivity applications versus floating-point & multimedia applications. The instructions executed per clock achievable by these different application categories varies greatly, & this variance is strongly affected by the number of branches that application code typically takes & the predictability of these branches. The more branches taken with lower predictability, the more opportunity to incorrectly predict the result of the branches, & hence the possibility of performing nonproductive work .

Integer & basic office productivity applications, such as word & spreadsheet processing, tend to have many branches in the code, thus reducing overall IPC capabilities. As a result, the associated branch penalties & performance on these applications does not generally scale as well with frequency & are more resistant to improvements in micro-architectural means, such as deeper pipelines. However, significantly raising the performance level on these types of applications that run in basic, non-multitasking, environments does not necessarily increase the user's experience, because the processing power required by these types of basic applications & environments tends to be satisfied by today's higher end Pentium III processors.

Floating-point & multimedia applications tend to have branches that are very predictable, & thus naturally have a higher average IPC capability. As a result, these types of applications generally scale very well with frequency & are inclined to benefit greatly from deeper pipelines. In addition, the processing power required by these applications tend to be unbounded: the more performance that is available, the better the user's experience .

The Pentium 4 processor shows immediate performance improvements across most existing software available today, with performance levels varying depending on the application category type & the extent that an application is optimized for the new micro-architecture.

Intel Pentium micro-architecture. original Pentium microprocessor was code-named "P5". Its product code was 80501 (80500 for the earliest steppings) & it operated at 60 MHz & 66 Mhz. It contained 3.1 million transistors & measured 16.7 mm by 17.6 mm for an area of 293.92 mm2. It was fabricated in a 0.8 µm BiCMOS process.

P54C

The P5 was followed by the P54C (80502), which operated at 75, 90 & 100 MHz. It employed an internal clock multiplier to let the internal circuitry work at a higher frequency than the front side bus, as it is much more difficult to increase the front side bus frequency. It also allowed two-way multiprocessing. It contained 3.3 million transistors & measured 163 mm2.It was fabricated in a 0.5 µm (described by Intel as "0.6 µm") BiCMOS process.

P54CQS

The P54C was followed by the P54CQS which operated at 120 MHz. It was fabricated in a 0.35 µm BiCMOS process, unlike early rumors of it being a CMOS design, & was the first commercial microprocessor to be fabricated in a 0.35 µm process. It had an identical transistor count to the P54C & despite the newer process, it had an identical area as well. The reason for this was because of time-to-market requirements. The chip was connected to the package using wire bonding, which only allows connections along the edges of the chip. A smaller chip would have required a redesign of the package, as there is a limit on the length of the wires & the edges of the chip would be further away from the pads on the package. The solution was to keep the chip at the same size, retain the existing pad-ring, & only reduce the size of the Pentium's logic circuitry to enable it to achieve higher clock frequencies.

P54CS

The P54CQS was followed by the P54CS, which operated at 133, 150, 166 & 200 MHz. It contained 3.3 million transistors, measured 90 mm2 & was fabricated in a 0.35 µm BiCMOS process with four levels of interconnect.

Bugs & Problems

The early versions of 60-100 MHz Pentiums had a problem in the floating point unit that resulted in incorrect results from some division operations. This bug, discovered in 1994 by professor Thomas Nicely at Lynchburg College, Virginia, became known as the Pentium FDIV bug & caused embarrassment for Intel, which created an exchange program to replace the faulty processors. Soon afterwards, a bug was discovered which could allow a malicious program to crash a system without any special privileges; fortunately, operating systems were able to implement workarounds to prevent crashes.

The 60 & 66 MHz 0.8 µm versions of the Pentium processors also had high heat production due to their 5V operation, & were often known colloquially as "coffee warmers" or some similar nickname. The P54C used 3.3V & had significantly lower power draw. P5 Pentiums used Socket 4, while P54C started out on Socket 5 before moving to Socket 7 in later revisions. All desktop Pentiums from P54CS onwards used Socket 7.

Pentium Over Drive

The P24T Pentium Over Drive for 486-systems were released in 1995, which were based on 3.3V 0.6 µm versions using a 63 or 83 MHz clock. Since these used Socket 2/3, some modifications had to be made to compensate for the 32-bit data bus & slower on-board L2 cache of 486-motherboards. They were therefore equipped with a 32KB L1 cache.

P55C, Tillamook

Intel Pentium MMX Micro-Architecture

Pentium MMX 166 MHz without cover the P55C (or 80503) was developed by Intel's Research & Development Center in Haifa, Israel. It was sold as Pentium with MMX Technology; although it was based on the P5 core it featured a new set of 57 "MMX" instructions intended to improve performance on multimedia tasks, such as encoding & decoding digital media data. The Pentium MMX line was introduced on 22 October 1996.

The new instructions work on new data types: 64-bit packed vectors of either 8-bit integers, four 16-bit integers, two 32-bit integers, or one 64-bit integer. So, for example, the PADDUSB instruction adds two vectors, each containing eight 8-bit unsigned integers together, pair wise; each addition that would overflow saturates, yielding 255, the maximum unsigned value that can be represented in a byte. These rather specialized instructions generally require special coding by the programmer for them to be used. The performance of the P55C was improved over previous versions by a doubling of the Level 1 CPU cache from 16 KB to 32 KB.

It contained 4.5 million transistors & had an area of 140 mm2. It was fabricated in a 0.28 µm CMOS process with the same metal pitches as the previous 0.35 µm BiCMOS process, so Intel described it as "0.35 µm" because of its similar transistor density. The process has four levels of interconnect.

Pentium P55C notebook CPUs used a "mobile module" that held the CPU. This module was a PCB with the CPU directly attached to it in a special smaller form factor. The module snapped to the notebook motherboard & typically a heat spreader plate was installed & made contact with the module. Such notebooks frequently used the Intel 430MX chipset, a feature-reduced 430FX. However, with the 0.25 µm Tillamook Mobile Pentium MMX, the module also held the 430TX chipset along with the system's 512 KB SRAM cache memory.

While the P55C is compatible with the common Socket 7 motherboard configuration, the voltage requirements for powering the chip differ from the st&ard Socket 7 specifications. Due to certain manufacturers not preparing for the introduction of MMX technology most motherboards manufactured for Socket 7 previous to the establishment of the P55C st&ard are not compliant with the dual intensity required for proper operation of this chip. The Intel Corporation temporarily manufactured a conversion kit called the Overdrive that was designed to correct this lack of planning on the motherboard manufacturers part.

Performance Optimization Techniques For Pentium Processor

Many independent software vendors (ISVs) have software applications that deliver good performance with Pentium III processors. An important question for programmers & these ISVs is: how can my existing applications benefit from the performance potential of the Pentium 4 processor? The answer is that most applications will see immediate benefits from the higher processor clock rates & the many micro- architectural enhancements available in the Pentium 4 processor without any software optimizations. To obtain an even greater performance gain, an application may be recompiled using a compiler optimized code for Pentium 4 processor or linked with libraries optimized for the Pentium 4 processor. Finally, software vendors may achieve the highest performance gains by following the programming guidelines outlined in Intel Pentium 4 Processor Optimization Reference Manual & using the SIMD integer & double-precision floating-point instructions included in the SSE2 extensions. Another important question for ISVs is: how much performance gain can optimizing for Pentium 4 processor deliver for an ISV's application? Since Pentium 4 processors are designed to run at significantly higher frequency than Pentium III processors that are manufactured from the same process technology, a useful metric for performance gain is to compare the performance of a Pentium 4 processor running at a frequency that is 1.5 times that of a Pentium III processor. Two useful reference points for the performance gains of the Pentium 4 processor are: SPECint2000 & SPECfp2000 (collectively known as SPEC CPU 2000). The SPECint 2000 is a suite of workloads (including data compression utilities, a C compiler, a chess program, etc.) that are representative of many typical computational tasks implemented using integer code . The computational tasks selected in the SPECint 2000 workload are similar to those in a wide range of commercial applications. The SPECint 2000 workload provides a better measure of processor performance because it is not diluted by non-scaling code such as waiting for user input or input/output operations on peripheral devices. Using SPECint 2000 as a reference, integer code can expect about 1.2 times (1.2X) performance gain on a Pentium 4 processor relative to a Pentium III processor at the frequency rates described above. The SPECfp 2000 workload represents a wide range of floating-point-intensive computational tasks (including shallow water modeling, 3D graphics, neural network, computer vision, etc.). On a Pentium 4 processor, SPECfp 2000 can achieve 1.7X performance gain relative to a Pentium III processor. Other nominal floating-point & multimedia code may expect performance gains in the range of 1.3X to 1.7X, depending on the details of individual code constructs.

The SPEC CPU 2000 benchmark results illustrate the performance scaling of the Pentium 4 processor when the application code is generated by compilers that can produce optimized code for superscalar, out-of-order processors with some additional consideration for the Pentium 4 processor. Applications that have not been updated by such compilers can also benefit from re-compiling &/or linking with optimized libraries. The actual performance gain in applications will depend on many factors, ranging from the characteristics of the workload mix, degree of integer versus floating-point code, ability to identify & correct coding pitfalls, hardware & software configurations, test procedures, etc.

Pentium II

The Pentium II br& refers to Intel's sixth-generation micro-architecture ("Intel P6") & x86-compatible microprocessors introduced on May 7, 1997. Containing 7.5 million transistors, the Pentium II featured an improved version of the first P6-generation core of the Pentium Pro, which contained 5.5 million transistors. However, its L2 cache subsystem was a downgrade when compared to Pentium Pro's. In early 1999, the Pentium II was superseded by the Pentium III.

In 1998, Intel stratified the Pentium II family by releasing the Pentium II-based Celeron line of processors for low-end workstations & the Pentium II Xeon line for servers & high-end workstations. The Celeron was characterized by a reduced or omitted (in some cases present but disabled) on-die full-speed L2 cache & a 66 MT/s FSB. The Xeon was characterized by a range of full-speed L2 cache (from 512 KB to 2048 KB), a 100 MT/s FSB, a different physical interface (Slot 2), & support for symmetric multiprocessing.

Pentium 4 Processor Performance Results

The case study of optimizing the MPEG 2 decoder illustrates the substantial, application-level, performance gain that software can realize on a Pentium 4 processor by strategically identifying & addressing critical code paths in an application. Typically benchmarks deliver higher performance results on Pentium 4 processors due to higher processor frequency & the features of the Intel Net Burst micro-architecture. In some cases, newer benchmarks deliver performance results that are representative of the benefit of re- compilation using compilers that generate optimized code for Pentium 4 processor. This section presents the Pentium 4 processor performance results of several benchmarks & representative applications with comments on the workload characteristics of these results.

Summary

The Intel Pentium 4 processor is designed to deliver the next-generation performance for desktop & workstation clients. It is based on the Intel NetBurst micro-architecture, which enables significantly higher clock rates & better performance scaling efficiency. Many software applications deliver appreciable performance gains on the Pentium 4 processor by directly benefiting from higher clock rates & micro- architectural enhancements, such as Rapid Execution Engine & Execution Trace Cache while others can gain dramatic improvements by recompilation using the latest optimizing compilers & libraries, or via assembler-level optimizations specifically targeted for the micro-architecture & using the SSE2 instruction set. Using a closed-loop performance tuning methodology, compilers that can generate optimized Pentium 4 processor code, & the “pDiff” tool in conjunction with VTune analyzer, programmers can quickly identify critical code & opportunities to gain user-appreciable performance for Pentium III , Pentium 4 processors as well as future IA-32 processors.

You may also find these documents helpful