Introduction To Arm Cortex M3 Processor Information Technology Essay

Published: November 30, 2015 Words: 4193

Simple computer architecture, shown in figure 1, includes one or more input units e.g. keyboard, one or more output units (e.g. visual display unit or VDU), a memory and central processing unit (CPU). The memory and the CPU together form an integral part of a computer system, since input and output signals can be integrated within the memory system.

A memory is a physical device or a virtual unit, which stores instructions or data to perform different functions of a CPU such as process an instruction and run a computer application. With reference to the memory, the CPU interprets instructions stored in the memory, performs the calculations, controls the flow of data along the data bus and determines which memory address to use. A system view of the Arm Cortex M3 chip or microcontroller unit (MCU) is shown in figure 2.

Figure 2: System View of ARM Cortex M3 MCU

As shown in figure 2, the Cortex M3 core/processor and the Debug system are exclusively developed by ARM Ltd, while other components can be developed by ARM, various design houses and chip manufacturers. Each Cortex M3 MCU-chip can be of different types and may comprise different memory sizes, peripherals and features. The Cortex-M3 processor includes a number of fixed internal debugging components, which provide various debugging operation supports and features, such as breakpoints, watchpoints, fault conditions and external debugging request input signals.

The ARM Cortex M3 processors belong to the next generation Cortex M family, which is based on ARM architecture version 7 (ARM v7). ARMv7 is divided into following different profiles:

A profile (ARMv7-A) - It refers to application processors which are designed for high-performance open application platforms such as high-end embedded operating systems (e.g. Symbian, Linux, Windows Embedded, etc.).

R profile (ARMv7-R) - It refers to processors designed for high-end embedded systems in which real-time performance is needed. It caters to applications such as high-end breaking systems and hard drive controllers in which high processing power and high reliability are essential and for which low latency is important.

M profile (ARMv7-M) - It refers to processors designed for deeply embedded microcontroller-type systems targeting low-cost applications in which processing efficiency is important and cost, power consumption, low interrupt latency, and ease of use are critical, as well as industrial control applications, including real-time control systems.

Therefore, 'M' refers to M-profile while '3' refer to the third type of processor developed in the Cortex-M processor family. The basic building blocks of the ARM core are shown in figure 3.

Figure 3: Basic building blocks of the ARM core

As shown in figure 3, the ARM core has following basic building blocks:

Instruction Register or Instruction Buffer - The instructions stored in memory travel along the data bus to the CPU where they are loaded into the instruction register. The instruction register is not part of the main memory. The process of loading the instruction register from memory is known as a 'fetch'.

Instruction Decoder and Control Unit - The instructions are in 'machine code' and the instruction decoder determines the function of each instruction. The instruction decoder and control unit determine what the other parts of the CPU do. The control unit is also in charge of the control bus. The process of interpreting each instruction is known as the 'decode' cycle.

Arithmetic Logical Unit (ALU) - The arithmetic and logic unit or ALU performs the mathematical functions as required. These may be arithmetic such as add, subtract or multiply or logical such as AND, OR, XOR etc. The process of performing each instruction is known as the 'execute' cycle.

Address Register - The address register is a 32 bit memory device which holds a memory address value. Either this address may be for the memory location of the next instruction during the 'fetch' cycle. Or during the 'execute' cycle the address is for a memory location either containing data to be loaded into a register or where data from a register is to be stored.

Register Bank - The register bank is a local memory for the CPU. Each register can hold 32 bits of data. The registers are named r0, r1, r2, r3, ... etc. up to r15. They are used to hold data which is processed by the ALU and also hold the results of any calculation. Registers r13, r14 and r15 and a few more registers have special functions.

The block diagram of the ARM Cortex M3 core is shown in figure 4.

Figure 4: Block diagram of ARM Cortex M3 core

As shown in figure 4, the ARM Cortex M3 core is a 32-bit microprocessor. It has a 32-bit arithmetic logical unit (ALU) including a hardware divider and single cycle 32-bit multiplier to support for hardware multiply and divide, a 32-bit register bank (not shown in the figure), and 32-bit memory interfaces (with respect to the instruction interface and the data interface). The processor uses has a Harvard architecture, which means that it has a separate instruction bus and data bus. However, the instruction and data buses share the same memory space (a unified memory system). This allows instructions and data accesses to happen simultaneously, which results in an increased performance of the processor since data accesses do not affect the instruction pipeline.

The control logic unit includes an instruction fetch unit and a decoder unit, while the Thumb & Thumb-2 decode box decodes Thumb and new Thumb2 instructions. The NVIC interface interacts with an interrupt controller called the Nested Vectored Interrupt Controller (NVIC), which supports nested interrupt, vectored interrupt, dynamic priority changes, interrupt masking, and reduction of interrupt latency. On the other hand, the ETM interface interacts with an Embedded Trace Macrocell (ETM) to allow instruction trace. The Trace information is output via the Trace Port Interface Unit (TPIU), and the debug host (usually a Personal Computer [PC]) can then collect the executed instruction information via external trace capturing hardware.

Registers in the ARM Cortex M3 core are shown in figure 5.

Figure 5: Registers in ARM Cortex M3 processor

As shown in figure 5, the ARM Cortex-M3 processor has registers R0 through R15, discussed below:

R0-R12: General-Purpose Registers: R0-R12 are 32-bit general-purpose registers for data operations. The R0 through R7 general purpose registers are also called low registers. They can be accessed by all 16-bit Thumb instructions and all 32-bit Thumb-2 instructions. The R8 through R12 registers are also called high registers. They are accessible by all Thumb-2 instructions but not by all 16-bit Thumb instructions. The registers R0 through R12 are of 32 bits and the reset values for all of them are unpredictable.

R13: Stack Pointers: The stack is a dynamic data structure operating as a 'last in first out' queue. It contains variable amounts of data so, in order to know where to add or remove data, a record of the memory address of the 'top' of the stack must be kept. The stack pointer holds a memory address indicating the top of the stack. The ARM Cortex-M3 core contains two stack pointers (R13). They are banked so that only one is visible at a time. The lowest 2 bits of the stack pointers are always 0, which means they are always word aligned. The two stack pointers are as follows:

• Main Stack Pointer (MSP): The default stack pointer, used by the operating system (OS) kernel and exception handlers.

• Process Stack Pointer (PSP): Used by user application code

R14: The Link Register: When a branch and link (BL) instruction is executed, the memory address of the first instruction after the BL is stored in the link register. The value in the link register is a 'return address'; it can be used to allow the program to continue execution of a 'main program' after completing execution of a subroutine 'called' by the branch and link instruction. In general, when a subroutine is called, the return address is stored in the link register.

R15: The Program Counter: The program counter is the current program address. This register can be written to control the program flow and holds the address of the next instruction to be fetched from memory. When most instructions are executed, the program counter is incremented by either 2 or 4 so that it fetches the next instruction in memory. A few instructions reload the program counter with a new value, for e.g., branch (mnemonic B) and 'branch and link' (BL) instructions.

Special Registers: The Cortex-M3 processor also has a number of special registers They are as follows:

• Program Status registers (PSRs)

• Interrupt Mask registers (PRIMASK, FAULTMASK, and BASEPRI)

• Control register (CONTROL)

These registers have special functions and can be accessed only by special instructions. The functions are shown in Table 1. They cannot be used for normal data processing.

General Operation of ARM Cortex M3 Processor or Pipelining

The Cortex-M3 processor has a three-stage pipeline or operation. The pipeline stages are instruction fetch, instruction decode, and instruction execution, which is shown in figure 6.

Figure 6: ARM Cortex M3 processor pipelining

As shown in figure 6, when programs with mostly 16-bit instructions are run, the processor might not fetch instructions in every cycle. This is because the processor fetches up to two instructions (32-bit) in one go. Since two instructions have already been fetched by the processor in one cycle and the buffer is now full, when the processor bus interface tries to fetch the second instruction in the next cycle the bus interface goes idle. On the other hand, some of the instructions take multiple cycles to execute, so the pipeline is stalled.

In executing a branch instruction, the pipeline is flushed such that the processor has to fetch instructions from the branch destination to fill up the pipeline again. However, the Cortex-M3 processor supports a number of instructions in v7-M architecture, so some of the short-distance branches can be avoided by replacing them with conditional execution codes.

When the program counter is read during an instruction execution, the read value is the address of the instruction plus 4 so that the program is compatible with Thumb codes while following the pipeline nature of the ARM Cortex M3 processor. If the program counter is used for address generation for memory accesses, the word aligned value of the instruction address plus 4 would be used. This offset of 4 is constant, independent of the combination of 16-bit Thumb instructions and 32-bit Thumb-2 instructions. This ensures consistency between Thumb and Thumb-2 instruction set architectures.

Inside the instruction pre-fetch unit of the processor core, there is also an instruction buffer or instruction register shown in figure 7.

Figure 7: Instruction buffer inside the instruction fetch unit

As shown in figure 7, the instruction buffer allows additional instructions (C1) to be queued before they are needed and prevents the pipeline being stalled when the instruction sequence contains 32-bit Thumb-2 instructions that are not word aligned. Since the instruction buffer does not add an extra stage to the pipeline, there is no increase in the branch penalty.

Instruction Set Architecture

An instruction set or instruction set architecture (ISA) is a set of operation codes, which forms commands in machine language for a processor to execute certain operations. The type of ISA is characterized by:

Size or the number of bits used to define an operation code

Code density

The greater the size of the operation code, the more instructions can be defined. However, a large size of the operation code decreases the code density, which is measure of the size of a computer program in memory for a given function. A good code density produces smaller programs leading to lower memory cost and less power dissipation in memory. There are a number of factors that affect code density:

Number of bits in each machine code instruction

Functionality of individual machine code instruction.

Performance of the compiler

Types of Instruction Set Architecture (ISA) for ARM processors

There are three types of ISAs for ARM processors:

ARM code ISA

The ARM code ISA is a base 32-bit ISA used in early processor architectures such as ARMv4T, ARMv5TEJ and ARMv6 architectures. It gives excellent performance with instructions executed on most clock cycles. This architecture is used in applications requiring high performance, or for handling hardware exceptions such as interrupts and processor start-up. The ARM 32-bit ISA is also supported in the Cortex-A and Cortex-R profiles of the Cortex architecture for performance critical applications, and for legacy code.

Thumb code ISA

When ARM7TDMI microprocessors based on ARMv4/ARMv4T architectures were developed, a 16 bit instruction set, called Thumb, was introduced. This Thumb code ISA or Thumb technology is an extension to the 32-bit ARM code ISA. The Thumb instruction set features a subset of the most commonly used 32-bit ARM instructions which have been compressed into 16-bit wide operation codes. On execution, these 16-bit instructions are decompressed transparently to full 32-bit ARM instructions in real time without performance loss. Therefore, although Thumb code uses approx. 40% more instructions for a given task than ARM code, it still has better code density by occupying only 70% of the memory space than that used by the ARM code and hence uses 30% less external memory power.

It is to be noted that the ARM code is 40% faster than Thumb code if instructions are fetched on a 32 bit bus, so in a system where performance is paramount, ARM code and a 32 bit memory system are used. However in a 16 bit memory system, Thumb code is 45% faster than ARM code. Therefore, in a system where memory cost and power consumption are important, a 16 bit memory system and Thumb code would be a better choice. The 16-bit Thumb and 32-bit ARM instructions sets provide better flexibility to emphasise performance or code size on a sub-routine level as their applications require.

Problem of Switching Overhead in Traditional ARM Processors

Although, in the Thumb state, the instructions are 16 bits, so there is a much higher instruction code density, but the Thumb state does not have all the functionality of ARM instructions and may require more instructions to complete certain types of operations. So, many applications have mixed ARM and Thumb codes. However, the mixed-code arrangement does not always work best.

Figure 8: Switching overhead between ARM Code and Thumb Code in traditional ARM Processors

As shown in figure 8, there is overhead (in terms of both execution time and instruction space) to switch between the states, and ARM and Thumb codes might need to be compiled separately in different files. This increases the complexity of software development and reduces maximum efficiency of the CPU core. So, Thumb-2 ISA was introduced in ARMv7 architectures such that the Thumb 2 ISA can handle all processing requirements in one operation state. There is no need to switch between the two. The ARM Cortex-M3 does not support the ARM code and now handles even interrupts with the Thumb state.

Thumb-2 code ISA

The Thumb-2 technology or Thumb-2 code ISA extends the Thumb code ISA into a highly efficient and powerful instruction set that delivers significant benefits in terms of ease of use, code size, and performance. The extended instruction set in Thumb-2 is a superset of the previous 16-bit Thumb instruction set, with additional 16-bit instructions alongside 32-bit instructions. It allows more complex operations to be carried out in the Thumb state, thus allowing higher efficiency by reducing the number of states switching between ARM state and Thumb state.

Thumb-2 code ISA supports for both 16-bit and 32-bit instructions, so there is no need to switch the processor between Thumb state (16-bit instructions) and ARM state (32-bit instructions). For example, in ARM7 or ARM9 family processors, a switch to ARM state might be required to carry out complex calculations or a large number of conditional operations for a good performance, whereas in the ARM Cortex-M3 processor, 32-bit instructions can be mixed with 16-bit instructions without entering into a switching state, yet obtaining high code density and high performance with no extra complexity. Since there is no need to switch between states, the Cortex-M3 processor has a number of advantages over traditional ARM processors, such as:

There is NO state switching overhead, thereby saving both execution time and instruction space.

There is NO need to separate ARM code and Thumb code source files, thereby making software development and maintenance easier.

Since there is NO switching between ARM and Thumb codes to get the best density/performance, it is easier to get the best efficiency and performance, in turn making it easier to write software.

Questions and Answers for further understanding

Find the mnemonic(s) for an instruction or instructions from the Thumb2 instruction set (as implemented on the ARM Cortex M3 microprocessor) for each of the following actions. If more than one instruction can be used, identify which instruction would be preferred.

Move (125)10 in register r5

Solution: 0<(d=5)<7 ; 0<(imm=125)<255

MOV r5, #125 ; Write value of 125 to r5

MOVS r5, #125 ; Write value of 125 to r5, flags get updated

MOV.N r5, #125 ; Write value of 125 to r5, force 16-bit operation

MOVS.N r5, #125 ; Write value of 125 to r5, flags get updated, force 16-bit operation

MOVS{.N} Rd, #imm8 ;0<d<7, 0<imm8<255

MOV{S}{.W} Rd, #ExpandedImm ;0 < d < 15

MOVW Rd, #imm16 ;0<d<15, 0<imm16<65535

MOV{.N} Rd, Rm ; 0 < d < 15, 0 < m < 15

MOVS{.N} Rd, Rm ; 0 < d < 7, 0 < m < 7

MOV{S}{.W} Rd, Rm ; 0 < d < 15, 0 < m < 15

Add 6 to the value in register r4 and put the sum in register r17

Solution: 0<(d=17)<7 ; 0<(imm=6)<7 ; low to high register, 3-bits immediate value

ADD r17, r4, #6 ; Action [r17 = r4 + 6]

ADD.N r17, r4, #6 ; Action [r17 = r4 + 6], force 16-bit operation

ADD.W r17, r4, #6 ; Action [r17 = r4 + 6], force 32-bit operation

ADDS r17, r4, #6 ; Action [r17 = r4 + 6], flags get updated

ADDS.N r17, r4, #6 ; Action [r17 = r4 + 6], flags get updated, force 16-bit operation

ADDS.W r17, r4, #6 ; Action [r17 = r4 + 6], flags get updated, force 32-bit operation

ADDW r17, r4, #6 ; Action [r17 = r4 + 6], specify the 'wide' constant 0 to 4095

ADDW.N r17, r4, #6 ; Action [r17 = r4 + 6], specify the 'wide' constant 0 to 4095,

force 16-bit operation

ADDW.W r17, r4, #6 ; Action [r17 = r4 + 6], specify the 'wide' constant 0 to 4095,

force 32-bit operation

ADDS{.N} Rd, Rm, #imm3 ;0<d,m<7, 0<imm3<7

ADDS{.N} Rd, Rd, #imm8 ;0<d<7, 0<imm8<255

ADD{S}{.W} Rd, Rm, #ExpandedImm ;0<d,m<15

ADDW Rd, Rm, #imm12 ;0<d,m<15, 0<imm12<4095

ADDS{.N} Rd, Rm, Rn ;0<d,m,n<7, Rd=Rm+Rn

ADD{.N} Rd, Rd, Rm ;0<d,m<15, Rd=Rd+Rm

ADD{S}{.W} Rd, Rm, Rn ;0<d,m,n<15, Rd=Rm+Rn

Subtract (2912)10 from the value in register r9 and put the difference in register r2

Solution:

RSB.W Rd, Rn, #immed ; Rd = #immed -Rn

RSB.W r2, r9, #0xB60 ; r2 = #0xB60 -r9

SUB r2, r9, #0xB60; Action [r2 = r9 + 2912]

SUB.N r2, r9, #0xB60; Action [r2 = r9 + 2912], force 16-bit operation

SUB.W r2, r9, #0xB60; Action [r2 = r9 + 2912], force 32-bit operation

SUBS r2, r9, #0xB60; Action [r2 = r9 + 2912], flags get updated

SUBS.N r2, r9, #0xB60; Action [r2 = r9 + 2912], flags get updated, force 16-bit

operation

SUBS.W r2, r9, #0xB60; Action [r2 = r9 + 2912], flags get updated, force 32-bit

operation

SUBW r2, r9, #0xB60; Action [r2 = r9 + 2912], specify the 'wide' constant 0 to 4095

SUBW.N r2, r9, #0xB60; Action [r2 = r9 + 2912], specify the 'wide' constant 0 to 4095,

force 16-bit operation

SUBW.W r2, r9, #0xB60; Action [r2 = r9 + 2912], specify the 'wide' constant 0 to 4095,

force 32-bit operation

Subtract the value in register r8 from (2,868,947,712)10 and put the difference in register r1

Solution:

Multiply the value in register r4 with the value in register r0 and put the product in register r4

Solution:

Find the machine code for the five instructions given above. Where more than one instruction in possible, only give the machine code for the instruction you have identified as the preferred instruction.

Solution:

Explain the difference between the divide instructions UDIV and SDIV. Illustrate your answer by performing the following calculations; X/Z and Y/Z, using both UDIV and SDIV where X is your student ID number (e.g. 200799999), Y=-X in 32 bit 2's complement format and Z is the 0x000000zz where zz is the ASCII code for the first letter of your family name.

Solution: The UDIV and SDIV instructions are hardware divide instructions that are introduced in the ARM processors for the first time. These instructions are only available in the Thumb-2 instruction set. These instructions allow the ARM Cortex M-3 processor to deliver an outstanding efficiency of 1.25 DMIPS/MHz, where DMIPS is Dhrystone* million instructions per second.

*Dhrystone is the benchmark for system programming involving no floating point operations and indicates a system output in number of iterations of the main loop per second.

Syntax for SDIV and UDIV

SDIV{cond} {Rd,} Rn, Rm

UDIV{cond} {Rd,} Rn, Rm

where:

cond is an optional condition code; see "Conditional Execution" section on page 358.

Rd is the destination register and is optional. If Rd is omitted, the destination register is Rn.

Rn is the register holding the value to be divided.

Rm is a register holding the divisor.

For both the UDIV (unsigned division) and the SDIV (signed division) instructions, if the value in Rn is not divisible by the value in Rm, the result is rounded toward zero.

The following programme performs a mathematical operation on the values in r1 and r2 to give the value in r0. Find the mathematical function that describes the value in r0 and plot it as a graph of r0 against r1 for r1 in the range -20 to 20 (as a two's compliment number) and for r2=0, 1 and 2.

Code segment returns value in r0 as a function of r1 and r2

ADD r3, r1, #2 ; comments omitted

MUL r4, r3, r3

CMP r3, #0

ITTE MI

SUBMI r5, r2, #1

ADDMI r5, r5, r5

RSBPL r5, r2, #1

MUL r6, r5, r4

ADD r0, r6, #6

Finish

Solution:

Write a simple programme using Thumb2 mnemonics that calculates the Fibonacci number sequence (use 32 bit unsigned integer format) and store these numbers at consecutive addresses starting for 0x00800000. The calculation should stop automatically when the Fibonacci number exceeds 32 bits. The Fibonacci sequence is 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, etc. and is calculated by adding the last two numbers in the sequence to find the next number. Comment your programme appropriately.

Solution:

C=a+b

A=b

B=c

.W & .N [53]

With the new instructions in Thumb-2 technology, some of the operations can be handled by either a Thumb instruction or a Thumb-2 instruction. For example, R0 = R0 + 1 can be implemented as a 16-bit Thumb instruction or a 32-bit Thumb-2 instruction. With UAL, you can specify which instruction you want by adding suffixes:

ADDS R0, #1 ; Use 16-bit Thumb instruction by default ; for smaller size

ADDS.N R0, #1 ; Use 16-bit Thumb instruction (N=Narrow)

ADDS.W R0, #1 ; Use 32-bit Thumb-2 instruction (W=wide)

The .W (wide) suffix specifies a 32-bit instruction. If no suffix is given, the assembler tool can choose either instruction but usually defaults to 16-bit Thumb code to get a smaller size. Depending on tool support, you may also use the .N (narrow) suffix to specify a 16-bit Thumb instruction. Again, this syntax is for ARM assembler tools. Other assemblers might have slightly different syntax. If no suffix is given, the assembler might choose the instruction for you, with the minimum code size.

In most cases, applications will be coded in C, and the C compilers will use 16-bit instructions if possible due to smaller code size. However, when the immediate data exceed a certain range or when the operation can be better handled with a 32-bit Thumb-2 instruction, the 32-bit instruction will be used. The 32-bit Thumb-2 instructions can be half word aligned. For example, you can have a 32-bit instruction located in a half word location.

0x1000 : LDR r0,[r1] ;a 16-bit instructions (occupy 0x1000-0x1001)

0x1002 : RBIT.W r0 ;a 32-bit Thumb-2 instruction (occupy ; 0x1002-0x1005)

Most of the 16-bit instructions can only access registers R0-R7; 32-bit Thumb-2 instructions do not have this limitation. However, use of PC (R15) might not be allowed in some of the instructions. Refer to the ARM v7-M Architecture Application Level Reference Manual [Ref. 2] (section A4.6) if you need to find out more detail in this area.

W

ADDW Add wide (#immed_12) [56]

MOVW Move wide (write a 16-bit immediate value to register) [57]

SUBW Subtract wide (#immed_12) [57]

ADDW Rd, Rn,#immed ; Rd = Rn + #immed ADD register with 12-bit immediate value [65]