Since the very beginnings of computing, the quest for efficiency and faster performance has led to rapid advances in the design of computing hardware. From large monstrosities such as the 860 square foot vacuum tube powered monstrosity that was the ENIAC to the modern system-on-a-chip, there have been innumerable ways to relay and process data. Vacuum tubes were replaced by transistors which in turn were incorporated into the microprocessor. In the last twenty years, personal computing devices have become increasingly common with large corporations designing software for the common user. Increase in performance has been largely achieved by increasing processor clock speeds until recently. However, CPU power dissipation and increased power consumption are problems that arise with the continued use of this approach (Hennessy & Patterson 2007, p. 18). Recently, the design of processors with �multiple cores� is the way chip manufacturers have dealt with these concerns. This essay attempts to outline multi-core machines with respect to their architecture, instruction sets and the design challenges currently faced.
1.1 What is a multi-core processor?
Before we define what a multi-core machine is, we take a look at a taxonomy which categorizes processor design based on the parallelism in instruction and data streams (Hennessy & Patterson 2007, p. 197, Appendix E, Appendix H).
1.1.1 Flynn�s Taxonomy
� Single instruction stream, single data stream (SISD) � The uniprocessor.
� Single instruction stream, multiple data streams (SIMD) � Processors that exploit data level parallelism.
� Multiple instruction streams, single data stream (MISD) � Systolic arrays/stream based processing.
� Multiple instruction streams, multiple data streams (MIMD) � Processors that operate multiple threads in parallel.
1.1.2 Definition of a multi-core processor
A �multi-core� machine is essentially an MIMD architecture (Hennessy & Patterson 2007, p. 198) that comprises two or more �cores� or processing units on a single integrated chip. The greatest advantage of a multi-core machine over a single-core machine is its ability to handle true thread level parallelism.
1.2 Evolution of the multi-core processor
The evolution of the microprocessor has been characterised by the increase in clock speeds and the decrease in the size of hardware components. The past twenty years have seen Intel leading the pack of microprocessor manufacturers, at least for the desktop segment. The first ever microprocessor, the 4-bit 4004, was introduced by Intel in the early 1970s operating at a speed of 108 KHz (Stallings 2003, p. 35). Since then microprocessor power has grown exponentially.
1.2.1 Moore�s Law
Figure 1.1 Moore's Law, Intel.
A rudimentary guiding principle in the design of processor architecture, Moore�s law(1965) stated that the number of transistors on a chip would roughly double every other year (Stallings 2003, p. 29).
1.2.2 Microprocessors through the years
The 8- bit 8008 and 8080 were both developed by Intel in the early 1970s, while Motorola fabricated their 6800. The move to 16-bit processors saw Intel developing the 8086 and the 8088 while Motorola developed the 68000. Intel�s 32 bit 80286 was a big revolution that led to the Pentium line of 32-bit personal computers which were fabricated until 2007. A move to 64-bit architecture also solved problems with addressable memory space.
1.2.3 Performance through the years
Each processor saw massive increases in clock speeds from 108 KHz all the way to the 3.8 GHz Pentium 4. Processor frequencies were directly related to performance. The pipelining approach (Patterson and Hennessy 2005, p. 370) where multiple instructions are overlapped in execution was developed to increase performance. Branch instructions, register renaming, trace caches, reorder buffers and dynamic scheduling were also some techniques developed to increase performance (Schauer 2008, p. 3). However, design challenges arose owing to massive power consumption and CPU power dissipation due to increasing clock speeds and multi-core machines were commercially available in 2007-2008.
2 Architecture of the Multi-Core Processor
In this section, we briefly discuss single core processor architecture first and then discuss multi-core architecture.
2.1 Pentium � A single core architecture design by Intel.
Figure 2.1 shows the internal architecture of a basic single core Pentium microprocessor which was released in the early 90s. On 22nd March, 1993, Intel released a microprocessor with the name of P5 as their initial fifth-generation micro-architecture. After that a long series of Pentium processors were released such as Pentium Pro, II, III, 4, D, M.
Figure 2.1 Pentium Architecture, Black 2010.
The improvements in the Pentium architecture over previous Intel processors were the multiple integer pipelines which were named by Intel as V-pipeline and U-pipeline, the pipelined floating point unit, BTB (Branch Target Buffer) for branch prediction, the separate data instruction caches, introduction of even parity bit checking for the 64-bit data bus and TLBs (Translation Look-aside Buffer). The architecture of the early Pentium used around three million transistors on a large 294 mm silicon die. This was among the largest microprocessors ever designed. Most of the overhead is possibly in the instruction decode, the microcode ROM and control unit. The instruction set had its effects as well.
2.2 Multi-core processor architecture
We discuss a basic multi-core processor architecture which is not specific to any design. Hardware wise, the basic idea is to build multiple replicas of the same core on one die. As the clock speed can only be increased up to a certain level, numerous techniques have been developed to increase CPU performance. The most successful technology was to introduce multiple cores on a single silicon die to speed up the performance. There is a large variety in the composition of multi-core architecture. Some of the architectures use a design which consists of the same replicated cores while other architectures use a combination of different processor cores, each of which are designed for a specific purpose (Teich 2005).
Figure 2.2 Simple Multi-core CPU Architecture, Barbic 2007.
2.3 The SPARC64 VI Architecture
We discuss the internal architecture of SPARC64 VI as an example. SPARC64 VI is a dual core microprocessor which was designed and jointly produced by Fujitsu and Sun Microsystems in July 2008. Each �core� of this dual core microprocessor is an enhanced SPARC64 V+ microprocessor. SPARC64 V+ was minimised to deploy it on a single die along with a secondary cache as shown below.
Figure 2.3 Block diagram of SPARC64 VI dual core microprocessor, Utrecht 2007.
There is no third level cache in SPARC64 VI but there is a large 6MB L2 cache on the chip which is shared by both cores in the processor. The technique used to implement multi-threading in SPARC64 VI is coarse grained multi-threading which is named by Fujitsu as vertical multi-threading (VMT). Each core executes two threads but only one thread is executed at a time. Time sharing decides which thread should be executed, or if the thread is performing a long latency action, the pipeline switches to the other thread in a timely fashion. Multi-threading needs replication of the control register, integer register, program counter and floating point register, so there is an individual set for each thread.
The internal architecture of the processor core of SPARC64 VI is shown in Figure 2.4.
Figure 2.4 Block diagram of the SPARC64 VI processor core, Utrecht 2007.
The L1 data cache (D-Cache) and instruction cache (I-Cache) are 128 KB. There is an IB (Instruction Buffer) which can hold up to 48 4-byte instructions and continues to supply the registers via the IWR (Instruction Word Register) when a miss has occurred in L1 I-cache. At most 4 instructions can be scheduled every cycle and find their path via the RSE (reservation stations for execution units), RSA (reservation stations for address generation), RSF (reservation stations for integer and floating point unit) to the registers.
The general register files serve both EX-A and EX-B (i.e. Integer Execution units) and EAG-A, and EAG-B (i.e. Address Generation units). EX-A and EX-B are not the same. For example, EX-A performs multiply and divide instructions while EX-B performs addition operations. There are two separate registers for floating point units i.e. FPR (Floating-point Register) which feed the Floating Point units FL-A and FL-B. At most four floating point results per cycle can be produced. Moreover, FL-A and -B can also execute divide and square root operations, which is another enhancement over SPARC4+ which had a separate component for these operations. The divide and square root instructions are not pipelined because of the iterative nature. The response from the EU (execution units) to the registers is updated by update buffers i.e. GUB and FUB. FUB (Floating point Update Buffer) is for floating point registers and GUB(General register Update Buffer) is for the general registers. Each RS (Reservation Station) can hold up to 10 instructions. This allows for the speculative dispatching of instructions i.e., the instruction for which the operand is not ready, can be executed later after the operand gets ready. This is the assumption that it allows more smooth flow of instructions towards the execution units.
3 Instruction Set Architecture
An instruction set architecture defines the assembly instructions of a microprocessor. This includes the information needed to interact with the microprocessor. It contains all the details needed by a programmer to write a program for microprocessor. ISA serves as a boundary between hardware and software (Patterson and Hennessy 2005, p. 22).
An instruction is the operation to be performed by the processor and it includes one or more operands. Instruction sets often vary by the design of a processor, of which CISC, RISC, X86 are the main ones in the current arena. It is a binary word which is 8, 16, 32 or 64 bits according to the processor�s architecture.
In 8 bit microprocessors (eg., 8085), they use dedicated registers to add two numbers. For example ADD B specifies that this operation will be performed between the content of the 8 bit register B and an 8 bit accumulator. And the result is stored back in the accumulator. Single operand instructions are predominant.
But in 32 bit microprocessors (Pentium), it is assumed that both operands are stored in registers. For example ADD BX, CX will add the 16 bit content of register BX with the content of CX and store the result in BX. Two operand instructions are predominant in these microprocessors.
Example commands for the MIPS instruction set can be:
� Load AC from memory, lw AC, (468)
� Store AC to memory, sw AC, (600)
� Add to AC from memory, add AC, (300).
3.1 Instruction set characteristics
Instruction set consists of the following elements:
Operation Code (Opcode): This is the operation to be performed for example ADD, SUB etc.
Source Operand reference: These are the input for the operands.
Result Operand Reference: Result of the operation.
Next Instruction Reference: After the completion of current instruction it will tell the CPU where to fetch the next instruction.
3.1.1 Types of instructions
These instructions can be divided into groups based on the type of operation they perform. The classification of assembly language instruction is as follows.
� Data Transfer Instruction � These instructions perform common operations to move data from one place to another but do not modify the data.
� Data Operation Instructions � These instructions modify the data. Arithmetic instructions make up a large part of this type.
� Program Control Instructions � These instructions control the flow of the program eg., JUMP.
3.2.2 Data types
The microprocessor may operate on more than one data type. For eg., integer, boolean, character etc.
3.2.3 Addressing modes
During the execution of the instructions, the microprocessor determines the operand and destination addresses. The manner in which the microprocessor accomplishes this task is called the addressing mode.
3.2.4 Instruction formats
When an assembly language instruction is converted into its machine code, it is represented as a binary value which is called instruction code. In this code, different groups of bits represent different parts of the instruction. Eg. ADD A,B,C (A=B+C)
4 bits 2 bits 2 bits 2 bits
Opcode Operand 1 Operand 2 Operand 3
Equivalent instruction code: 1010 00 01 10.
3.3 Parallelism
In section 1.1.1, a brief taxonomy characterizing instruction level parallelism was discussed. As a continuing remark, MIMD is an instruction set architecture for parallel computing, typically used in multiprocessor computers. Thus, multi-core machines follow the MIMD design(Hennessy & Patterson 2007, p. 198).
Instruction level parallelism is not true parallelism and is achieved by improvement techniques such as pipelining. Multi-core machines go beyond this and achieve thread level parallelism. However, it is not easy to achieve parallelism just yet. We will discuss the challenges of implementing thread level parallelism on multi-core instruction sets in the following section.
4 Challenges for multi-core processrs
There are a few challenges with having multiple cores on the same die. The total power consumed by the processor is multiplied for each core on the same die. This could be reduced by lowering core frequencies. Heat dissipation is also a possible problem and this is addressed by strategically placing cores to reduce hot spots (Schauer 2008, p. 9).
With regards to parallel processing of instructions, there are three main challenges.
4.1 Cache coherence
Multi level caches substantially reduce a processor�s memory bandwidth requirements (Hennessy & Patterson 2007, pp. 204-237). However in multi-core environments, the same cache is distributed among the cores. This poses a significant problem in that, the most recent data needed for processing by a core may have changed due to parallel processing by another core. To ensure coherency, it is essential that if a core is operating on shared memory, the other cores know about this operation. Also, writing to the same memory location should be serialized to prevent illogical processing.
4.1.1 Snooping protocols
This state based protocol to maintain coherence requirements works as follows.
� Send all requests for data to all cores
� Cores snoop to see if they have a copy and respond accordingly
This protocol requires broadcasting to all cores, and works very well with a bus medium as seen on most small scale computing machines in the market.
4.1.1.1 A Write-Invalidate Snooping Protocol
This is a common snooping protocol which works as follows:
A core gets exclusive access to a cache block thereby invalidating all other copies before writing it. When one core reads an invalid cache block, it is forced to fetch a new copy. If two processors attempt to write simultaneously, one of them is first, owing to the bus medium. The other one must obtain a new copy, thereby enforcing write serialization
4.2 Synchronisation
Another significant challenge arising because of parallelism in multi-core machines is synchronizing processes(Hennessy & Patterson 2007, pp. 237-242). Take for example the classic bank account problem implemented using processor cores
Core 0 Core 1 Core 2
lw $t0,balance
lw $t1,amount
add $t0,$t0,t1
sw $t0,balance.
lw $t2,balance
lw $t3,amount
sub $t2,$t2,t3
sw $t2,balance.
lw $t4,balance
lw $t5,amount
add $t4,$t4,t5
sw $t4,balance.
And with a possible increase in the number of cores, the invasion of a process�s critical space becomes even more common.
To overcome such a catastrophic situation, a �spin lock� is built using hardware primitives (coherence mechanisms of any multi-processor system).
4.3 Memory Consistency
Another challenge belying multi-core designs is that different cores may not see writes at the same time. Sequential consistency is the property that all cores see all loads and stores happening in the same order (Hennessy & Patterson 2007, pp. 243-246). Sequential consistency is implemented in a simple way by delaying the completion of any memory access until all invalidations caused by that access are completed or by delaying the next memory access until the previous one is completed. A relaxed implementation of the above methods using synchronization techniques allows for an efficient implementation.
4.4 Multithreaded programming
A very relevant challenge facing multi-core design is the development of multithreading or parallel processing techniques to derive the most out of a multi-core processor. With the possible exception of Haskell and perhaps Java, there aren�t many programming languages that support multi-core extensions. In order to develop programs that utilize multi-core machines efficiently, programmers need to utilize the shared resources properly and this requires using the correct multi-core extensions for the particular programming language.
5 Conclusions
Most modern end-user machines in the market today have processors sporting multi-core designs. The last twenty years have seen herculean jumps in processor technology. An increase in clock speeds introduced new problems which were solved by the increase in the number of processors on a single chip. However, even this new design comes with its own issues.
With not just the design challenges to surmount, there are other issues that need to be addressed as well. Cache access times have largely remained the same in spite of the increase in the number of cores that access the same cache via a bus. The number of cores on a single chip is expected to reach the hundreds in the next few years if we take into account the fact that Intel announced a 48 core chip recently. The level of parallelism in most multi-threaded applications does not make full or balanced use of the multi-core design. However, we have already seen much improvement in some areas. There are excellent multi-core libraries available for GHC and javac and there is much ongoing research by the big corporations to implement commercially viable software on multi-core systems. As has always been the case, these innovations will only ensure the continued development of multi-core processors breaking new ground consistently.