Optimization Techniques Involved In Peephole Optimization English Language Essay

Programmers generally give algebraic expressions in their simplest form, but not always, simplification deals with this. A set of algorithms applies arithmetic identities attempting to find the one with the smallest set of instruction. This is not simple, doing it generally is not possible. For e.g. would you know that:

sqrt(exp(alog(x)/y)) = exp(alog(x)/2*y)

Your compiler probably will, but only because the calculation occurs in the whetstone benchmark, so compiler designers have tuned for it.

Constant Folding

Find out at compile time that an expression is a constant for example if you write 2 * PI * x / 4 it will reduce to 1.570796337 * x. This is very useful, especially because every compiler for many years has performed it so programmers know it happens and rely on it, making programs clearer.

Common sub-expression elimination (CSE)

Programmers often repeat equations or parts of equations, e.g.

x = sqrt(M) * cos(θ);

y = sqrt(M) * sin(θ);

In this case the common subexpression is the "sqrt(M)" part, the CSE stage will store the result of the first sqrt(M) and reuse it instead of recalculating it. Unlike the other two optimizations I have mentioned there is a problem with CSE, it can make a program slower in certain circumstances. In particular when eliminating the duplicated code its result must be stored so it can be reused. In some cases (if it is stored in memory) loading and restoring from memory can take more time than redoing an operation. Problems like this must be accounted for. In general most optimizations run the risk of making things worse. Common subexpression elimination is very important in array address calculations since some parts of the calculations can be reused. You may have heard of GCSE; Global CSE is elimination across basic blocks.

At this point is worth saying a little about the word `global', which unfortunately has confused meaning. When people speak about `global-optimizations' they usually mean optimizations of whole programs, this is similar to the idea of global variables. `Global-CSE' though merely means `over many basic blocks', as does global register allocation.

Data flow

Most optimizations deal with data flow, so code is brought into that form: a directed acyclic graph. Here is the dag of some expressions:-

d = a + b

f = c - d

f = e / b

g = b + a

d = f + e

a + b is a common to d = a + b and g = a + b, so instead of creating a new node for it, the node is reused, so common subexpressions are found. Here the code building the dag knows that + is commutative, it may also apply other algebraic rules. f in f = c - d is overwritten by a later assignment and is therefore dead code. So dead code can be found by looking for nodes with no variables left alive.

Copy propagation

Deals with copies to temporary variables, a = b. Compilers generate lots of copies themselves in intermediate form. Copy propagation is the process of removing them and replacing them with references to the original. It often reveals dead-code.

Inlining

Inlining is a conceptually simple optimization. When a short subroutine call is found it is replaced by the contents of the subroutine. This is a complex optimization to perform for several reasons. Firstly the same function must be placed in different subroutines; different variables are assigned to registers in each subroutine. This means the only way to efficiently inline subroutines is to expand them in the context they appear each time they appear (using different registers for example). The second related problem is one of a group of "phase ordering problems"; doing one optimization before another sometimes has the combined effect of producing inferior code. Specifically code should be inlined early in the compilation process to allow further passes to perform optimizations on the inlined version. Whether inlining is a good idea depends on the final length of the inlined version, which is not known until close to the end of the process. There are several partial solutions to this problem:

Advance compilation through the other passes and find out how much code is generated.

Estimate from available information how much code will be generated.

Use several inlining passes at different stages of compilation.

Using these methods reasonable inlining can be performed most of the time, at the cost of quite a lot of complexity. Note also that in C where all functions are global and may be declared in another file a non-inlined version is always needed in case this happens. It maybe subsequently removed by the linker if no such declarations occur.

The form of generated code

The forms of compiler generated code vary, but most are similar. Specifically they:

Have a small number of registers for commonly used values.

Perform arithmetic operations (i.e. +,-,*,/) on integers and floating point numbers.

Apply Boolean operations, such as `or' or `and' to registers.

Load and store data to addresses in memory.

Jump to a piece of code at some address in memory.

Make decisions, normally by jumping or not jumping on the basis of the value of a register or flag.

As far as I am aware all modern compilers assume roughly this model of a processor throughout most of their code. Extra instructions the architecture supports are either ignored, or used only in libraries or used by grouping simpler instructions prior to generating assembly language.

In order to store global variables compilers simply use static memory space. To store the return addresses of subroutines modern compilers use a stack. A stack is like one of those spring-loaded plate-warmers, if you're reading this you probably know what I mean. Local variables, parameters to be passed, temporaries that won't fit in registers and return values are also stored on the stack. At the beginning of a subroutine code is inserted to allocate space for them, the `activation record'. At least the parameter passing part of the activation record must be standard, if it wasn't object code and libraries compiled with different compilers wouldn't be compatible. Compilers utilize the registers to store values; these registers must be saved somehow. So they are put on the stack, generally there are two methods. Firstly, either the function being called saves them by storing them on it's stack and restoring them before returning: `callee save'. Or the function doing the calling saves them, `caller save'. There are trade-offs with both. Doing too much work in callee's makes inefficient the lowest-level functions, but there are a lot of callers too. The answer is a compromise; a group of registers are set asides as caller saved, if a subroutine is short enough to get away with using only those it need do no saving itself. The rest are callee saved, this normally includes infrequently used registers like floating point and vector registers. Code to do the saving and restoring at the beginning and end of subroutines is called the prologue and epilogue respectively. All this stuff must be specified to allow code to inter-operate this is an ABI - Application Binary Interface specification.

Whole program optimization can improve this further. The `linker' is normally just a means of joining object files together to form an executable. Whole program optimization involves burying a whole or partial compiler in it. This allows all the calls internal to a program to be optimized in anyway the compiler sees fit. To make this possible object files are sometimes not object files at all, but a binary representation of the compilers internal representation.

Tail call optimizations

As I said earlier subroutines are dealt with by pushing the return address onto the stack and popping it off at the end of the subroutine. This creates an interesting possibility. Consider the code:-

sub fred {

jim;

sheila;

}

`jim' and `sheila' are calls to other subroutines. Also `sheila' is the last piece of code before the end of the subroutine, a `tail call'. Normally the address of the instruction after `sheila' (the first instruction of the epilogue) would be placed on the stack then sheila would be called. As sheila finishes it will pop the return address of the stack and use it, now returning to the end of `fred', the address fred was given will be popped off the stack. A trick can be done here, sheila can be jumped to without recording anything on the stack, if this is done the return by popping the stack will still work right. At the end of sheila the return address of fred will be at the top of the stack.

Okay what's the point? Calls aren't that expensive. A `self-recursive call' is a subroutine calling itself. A `self tail call' is a recursive call and the last call in a routine. A special optimization can be implemented in this case (which isn't that rare), the call can be replaced by a straight jump. Because the call is executed many times in this situation it is a profitable trick. This is an interesting optimization because it has effects that ripple far from the reduced call overhead. If any arguments are passed then once optimized the activation record isn't rebuilt each time. Imagine a recursive routine that is called 1 billion times and has one integer in its activation record, it'll use over 4GB of memory, whereas after the optimization has been done it will use 4 bytes. So the routine will crash or seg-fault without the optimization and work with it. As you may imagine this can't be used in portable code. Lisps have had this optimization for decades and lisp programmers often rely on it, a very useful state of affairs.

e.g. foo()

{

if (l = "X") {

bar();

foo();

}

becomes

foo()

{

L: if (l = "X") {

bar();

goto L;

}

Loop optimizations

Very commonly most of the work of a program is done in one or two small loops that are iterated huge numbers of times. The following high-level optimizations are targeted specifically at loops because of the large amount of work they do. In mathematical code the situation I mention above where most of the work is done by a loop is almost universal, for this reason loop optimizations are the most important. Compilers for supercomputers and number-crunching clusters spend most of their time and have most of their complexity in performing loop optimizations. Loops often access arrays and many loop optimizations rely on this fact. Programmers know the importance of loops and only pay attention only to their code when evaluating algorithms.

Loop invariant hoisting

This optimization is often referred to as "strength reduction" although properly strength reduction is a more general term. Anyway, the operations in the body of a loop are done many more times than those outside the loop, so it's better to do as much as possible outside the loop. Loop invariant hoisting moves loop invariant calculations out of the loop. Advanced implementations work out whether it is better (or possible) to move the invariant part out to the beginning of the loop or the end.

Loop unrolling

The motivation for this optimization is the same as invariant hoisting; code in the middle of a loop is executed many times so more resources should be spent optimizing it. Unrolling means copying out the body of the loop several times and changing the loop index correspondingly. Doing this means that the time spent executing the branch instruction at the end of the loop and updating the index is halved. This may appear a trivial gain but if the loop body is small it can be very significant indeed. For instance the following assembly language adds a value in floating point register F2 to every element of an array:

Loop:

LD F0, 0(R1) ; get array element no R1 and store in F0

ADDD F4, F0, F2 ; perform add

SD 0(R1), F4 ; store result back where it came

SUBI R1, R1, #8 ; R1 = R1 - 8; 8 bytes = 1 double precision floating point number

BNEZ R1, Loop ; Branch if not equal to zero

This example is drawn from [1]. Using the very crude assumption that the processor always executes an instruction per clock cycle in the above two cycles per loop iteration are spent performing the subtract and the branch instructions. Unrolling the body of the loop once more gives:

Loop:

LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

LD F6, -8(R1)

ADDD F8, F6, F2

SD -8(R1), F8

SUBI R1, R1, #16

BNEZ R1, Loop

Under the crude assumption (one instruction per cycle) I gave before this loop will execute in 80% the time of the original version. This may be taken further; it could be unrolled 3 or more times. Each unrolling will reduce the cost of the SUBI & BNEZ instructions, the "loop overhead". There is of course a limit where further unrolling gains little benefit; but it varies from machine to machine and loop to loop.

Loop fusion and Loop fission

Programmers don't always split work across loops in an optimal way. For instance sometimes sets of loops that iterate across the same range of the same data, in this case the loops can be fused into one, removing some loop overhead and improving cache behavior.

/* Before */

for (i = 0; i < M; i = i + 1) a[i] = b[i] / c[i];

for (i = 0; i < M; i = i + 1) d[i] = a[i] + c[i];

/* After */

for (i = 0; i < M; i = i + 1) {

a[i] = b[i] / c[i];

d[i] = a[i] + c[i];

}

Induction variable elimination

Sometimes the loop variable isn't needed, for instance in the following loop

for(i = 0; i < M; i = i + 1) a[i] = 0;

i isn't actually needed, it can be implemented as:

Loop:

SD (R1), 0 ; store zero

ADDI R1, R1, 4

BNEZ R1

Notice this loop also goes backwards. Bringing me to:

Iteration order reversal

Doing a loop in the order the program indicates isn't always a good idea. In particular, iterating from a low to a high number requires a comparison at the end of each iteration of the loop. If the loop ends on a zero things are easier, no comparison is needed since most machines have a "branch if zero" instruction.

Algorithms

Aho-Corasick

Rabin-Karp

Bit-Parallel (Shift OR)

Commentz Walter

Wu-Manber

Time Complexity

Linear

Sub-linear

Search Type

Prefix

Suffix

Key Ideas

Finite automaton that tracks the partial prefix match.

Compare the text and the patterns from their hash functions

Bit-parallelism and q-gram for prefix matching.

it uses is to shift by as much as determined by the longest proper prefix of the pattern

Determine the shift distance from a block of characters in the suffix of the search window.

Approach

Automaton-based

Hashing-based

Bit-parallelism based

Automaton + Heuristic based

Heuristics based

Algorithm

Name

Author

Comparison

Order

Preprocessing

Searching Time

Complexity

Boyer

Moore

R.S. Boyer

and J.S. Moore

From right to

Left

Yes

O(mn)

Horspool

Nigel

Horspool

Is not

Relevant

Yes

O(mn)

Brute

Force

Is not

Relevant

O(mn)

Kunth

Morris

Pratt

Michael O

Rabin and

Richard M

Karp

From left to

Right

Yes

O(n+m)

independent from

the alphabet size

Quick

Sunday

Is not

Relevant

Yes

O(mn)

Karp

Rabin

Michel O

Rabin and

Richard M

Karp

From left to

Right

Yes

O(mn)

Zhu

Takaoka

R. F. Zhu and

T Takaoka

From left to

Right

Yes

O(mn)

Index

Based

IFBMPM

Model

From left to

Right

Yes

O(mn)

Optimization Techniques Involved In Peephole Optimization English Language Essay

You may also find these documents helpful