Optimizing Program Performance

How to make programs run faster:

  1. Select appropriate algorithms and data structures
  2. Write code with the compiler in mind

When writing code we must consider:

  • how will it be used
  • what critical factors affect it.

We must balance:

  • readability
  • speed
  • modularity

Compilers have limitations

  • optimization blockers: aspects of a program that depend strongly on the execution environment
  • programmers must assist the compiler by writing optimizable code

Code Optimization

Step One

  • Eliminate unnecessary work
  • Make code work as intended as efficiently as possible.

This includes removing unnecessary

  • function calls
  • condition tests
  • memory references

To maximize speed, we need to know how instructions are processed by CPU (including pipelining) and the timing characteristics of the different operations.

Step Two

  • Exploit the capabilities of processors to provide instruction-level parallelism.

A code profiler is a tools that measure performance on different parts of a program.    These help focus optimization.  When using code profilers, code optimization is a series of code transformations along with trial and error.

A good strategy:

  • start by looking carefully at the code for the inner loops
    • loop for performance reducing attributes such as memory reference and poor use of registers
  • rewrite code until the compiler produces optimized machine code.
    • better approach than writing in assembly – more portable.

GCC Optimization Options

GCC allows different levels of optimization (see Man page).

-O0  Reduce compilation time and make debugging produce the expected results. This is the default.

-O  The compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time.

-O1 Optimize. Optimizing compilation takes somewhat more time, and a lot more memory for a large function.

-O2  Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. As compared to -O, this option increases both compilation time and the performance of the generated code.

-O3 Optimize yet more.

-Os Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size.

There are limitations to compiler optimization.  Book demonstrates some code that is better when not optimized.

Optimization Blockers

Are these two functions equivalent?

void twiddle1(long *xp, long *yp){
    *xp += *yp;
    *xp += *yp;
}
void twiddle2(long *xp, long *yp){
    *xp += 2* *yp;
}
  • twiddle1 has 6 memory dereferences.
  • twiddle2 has 3.

The compiler can not optimize twiddle1 to behave as twiddle2 because they are not equivalent.  Consider when xp and yp point to the same address, that is, yp is an alias of xp.

  • twiddle1 increases the value at xp by a factor of 4.
  • twiddle2 increases the value of xp by a factor of 3.

Thus one optimization blocker is when a compiler cannot determine if two pointers may be aliased.

Another example

x = 1000;
y = 3000;
*q = x;
*p = y;
t1 = *q;

What is the value of t1?  It depends on whether or not q can be aliased.

A Second Optimization Blocker

Consider the following functions.

long f();          // right now we don't know what f() does

long func1() {
    return f() + f() + f() + f();
}
long func2() {
    return 4 * f();
}

At first look, they look equivalent.  But it really depends on f().  Suppose f() is defined as follows:

long counter = 0;

long f() {
    return counter++;
}

f() has a side effect (changes global state).

Most compilers don’t check for side effects and leave function calls intact.

Expressing Program Performance

CPE: Cycles-per-element

void psum1(float a[], float p[], long n) {
    long i;
    p[0] = a[0];
    for (i = 0; i < n; i++)
        p[i] = p[i-1] - a[i];
}
void psum2(float a[], float p[], long n) {
    long i;
    p[0] = a[0];
    for (i = 1; i < n-1; i+=2) {
        float mid = p[i-1] + a[i];
        p[i] = mid;
        p[i+1] = mid + a[i+1];
    }
    // for even n
    if (i < n)
        p[i] = p[i-1] + a[i];
}
  • psum1 runs in 368 + 9.0n time.
  • psum2 runs in 368 + 6.0n time (much quicker as n grows large)

Example in Text

Gives code for vector data structure along with get_vec_element(…) and combine1(…) methods.

Shows that by using -O1 compiler flag, the speed is halved.

Eliminating Loop Efficiencies

A general class of optimizations is called code motion.  They involve identifying a computation that is performed multiple times such that the result of the computation does not change.  We can move the code to another area in the code that is not called as often.

Example

/* Data Structure */

typedef struct {
    long len;
    data_t *data;
} vec_rec, *vec_ptr;

long vec_length(vec_ptr v) {
    return v-> len;
}

/* Code utilizing data structure*/

vec_ptr v;
. . .
long i;
for ( i = 0; i < vec.length(v); i++) {
    . . .
}

A change utilizing code motion would be to compute length once and store it in a variable.

vec_ptr v;
. . .
long i;
long len = vec.length(v);

for ( i = 0; i < len; i++) {
    . . .
}

Example

char* s;
. . .
long i;
for (i = 0; i < strlen(s); i++) {
    . . .
}

long i;
long len = strlen(s);
for (i = 0; i < len; i++) {
    . . .
}

As string length increases to 500,000 the first method approaches 270 CPU seconds on strings of length 500,000 compared to .002 seconds for the second method.

But when would the inefficiency be noticed?   Often times when the code has been deployed.

Reducing Procedure Calls

int get_vec_element(vec_ptr v, long index, data_t *dest){
    if (index < 0 || index >= v->len)
        return 0;
    *dest = v->data[index];
    return 1;
}

/**  Method in use **/
vec_ptr v;
. . .
long i;
long len = vec.length(v);
for ( i = 0; i < long; i++) {
    data_t val;
    get_vec_element(v, i, &val);
    . . .
}

Since the outer for-loop can not go outside of bounds the bounds-checking inside get_vec_element() is redundant and so we should access the array directly.

vec_ptr v;
. . .
long i;
long len = vec.length(v);
data_t *data = v->data;

for ( i = 0; i < long; i++) {
    data_t val = data[i];
    . . .
} 

 Eliminating Unneeded Memory References

Code that reads and writes RAM can sometimes be optimized.

data_t *dest;
...
long i;
*dest = 0;

for ( i = 0; i < long; i++) {
    data_t val = data[i];
    *dest = *dest + val;
}

Since the value saved at dest is used in the next iteration of the loop, there is no reason to dereference repeatedly.  A temp variable (implemented with a register) is faster.

data_t *dest;
...
long i;
data_t acc = 0;

for ( i = 0; i < long; i++) {
    acc = acc + data[i];
}
*dest = acc;

Considering how aliasing can affect your code.

Understanding Modern Processors

In an actual processor, the execution of a program may be far different than what is perceived by looking at machine level programs.  Some operations are evaluated simultaneously, a process referred to as instruction-level parallelism.  In some processors there can be 100 or more instructions “in flight”.

Two lower bounds characterize the maximum performance of a program.

  • The latency bound is encountered when a sequence of operations must be performed in a particular order.
  • The throughput bound characterizes the raw computing capacity of the processor’s functional units.

Overall Operation

A superscalar CPU can

  1. Perform multiple operations on every clock cycle
  2. Perform operations out of order (compared to the machine-level program)

<<Block diagram of an out-of-order processor>>

Instruction Control Unit (ICU)

  • Read a sequence of instructions from memory and generate a set of primitive operation to perform
  • The ICU fetches well ahead of the currently executing instructions. They are stored in an instruction cache.
    • A technique called branch prediction attempts to guess whether or not a branch will be taken.
    • Speculative execution fetches, decodes and executes the instructions at where it believes the branch will go.
    • If it is incorrect, it resets the state to that of the correct branch and begins execution there.
  • The instruction decoding logic converts program instructions to a set of primitive CPU operations (sometimes called mico-operations).
addq %rax, %rdx          // one op
addq %rax, 8(%rdx)       // one to load, one to add, one to store

Execution Unit (EU)

  • Dispatches primitive operations to functional units
  • Functional unit types
    • Load unit: Read from memory, has adder to compute memory addresses
    • Store unit: Write to memory, has adder to compute memory addresses
    • Arithmetic unit: can often perform multiple arithmetic ops so that unit stays busy (like floating point and integer arithmetic)
  • Partial list of the functionality of the units (0-7) in the Intel Core i7 Haswell
  1. Integer arithmetic (addition, bitwise ops, shifting), floating-point multiplication, integer and floating point division, branches (jne)
  2. Integer arithmetic, floating-point arithmetic, integer multiplication, floating-point multiplication
  3. Load, address computation
  4. Load, address computation
  5. Store
  6. Integer arithmetic
  7. Integer arithmetic, branches (jne)
  8. Store address computation

Functional Unit Performance

Integer Floating Point
Operation Latency Issue Capacity Latency Issue Capacity
Addition 1 1 4 3 1 1
Multiplication 3 1 1 5 1 2
Division 3-30 3-30 1 3-15 3-15 1
  • Latency – total time required to perform the operation
  • Issue time – minimum number of clock cycles between two independent operations
  • Capacity – number of functional units capable of performing operation

An Abstract Model of Processor Operation

To analyze the performance we use the data-flow model, a graphical notation showing how the data dependencies between the different operations constrain the order in which they are executed.  The constrains lead to critical paths in the graph, putting a lower bound on the number of clock cycles required to execute a set of machine instructions.

Computing the product or sum of n elements requires L * n + K clock cycles where L is the time required for one product or sum op (latency) and K is the time required to call function, initiate loop and terminate loop.

In this problem, the CPE (cycles per element) is equal to the latency L not (L + K/n)???

From Machine Level Code to Data Flow Diagram

Consider the following assembly code

data = double, OP = *

Note: acc (sum) in %xmm0,  data+i (current elm) in %rdx,  data+length (end of array) in %rax

.L25:
    vmulsd   (%rdx), %xmm0%xmm0      // Mult acc by data[i]
    addq     $8, %rdx                // Increment data + i
    cmpq     %rax, %rdx              // Check if at end of array
    jne      .L25                    // if != loop again

The 4 instructions are expanded into 5.

Note:  %rdx is used as both input and output.  %rax is used only for input.

For a code segment forming a loop, we can classify the registers that are accessed into 4 categories.

  • Read-only:  Used as source values, either as data or to compute memory addresses.  They are not modified. (%rax)
  • Write-only:  Used only for the destination of operations. (none here)
  • Local:  No dependency in the loop from one iteration to another (for example, the condition codes registers).
  • Loop: These are used as both source and destination (%rdx and %xmm0)

As we will see, the chain between loop registers will determine the performance limitations.

Now we rearrange (a) below.   And (b) remove ops that do not change values of registers.

<< images a and b >>

We see in (b) that two registers are dependent on the results of an iteration of the loop ($xmm0 and %rdx).

Chaining multiple iterations we get the following diagram.

<< insert diagram >>

This loop has two chains of data dependencies.  Since floating point multiplication has a latency of 5 cycles and addition has a latency of 1 cycle, the chain on the left form a critical path requiring 5n cycles to execute.

Loop Unrolling

Loop unrolling is a program transformation that reduces the number of iterations for a loop by increasing the number of elements computed on each iteration.

Loop unrolling can improve performance by

  • reduces the number of ops that do not directly contribute to computing a result
  • it exposes further transformations to reduce the number of ops in the critical paths.
/* 2 x 1 loop unrolling */
void combine(vec_ptr v, data_t *dest) {
    long i;
    long length = vec_length(v);
    long limit = length - 1;
    data_t *data = get_vec_start(v);
    data_t acc = IDENT;

    /* Combine 2 elements at a time */
    for (i = 0; i < limit; i+=2) {
        acc = (acc OP data[i]) OP data[i+1];
    }

    /* Finish any remaining elements */
    for (; i < length; i++) {
        acc = acc OP data[i];
    }
    * dest = acc;
}

We can generalize this idea to unroll a loop by any factor of k, yielding k x 1 loop unrolling.

Note: i in %rdx,  data in %rax,  limit in %rbx, acc in %xmm0

.L35:
    vmulsd   (%rax, %rdx, 8),  %xmm0, %xmm0         // mult acc by data[i]
    vmulsd   8(%rax, %rdx, 8), %xmm0, %xmm0         // mult acc by data[i+1]
    addq     $2, %rdx                               // increment i by 2
    cmpq     %rdx, %rbx                             // compare to limit
    jg       .L35                                   // if >, goto loop
Integer (+) Integer (*)
No unrolling 1.27 3.01
2 x 1 unrolling 1.01 3.01
3 x 1 unrolling 1.01 3.01
Latency for OP 1.0 3.0

Loop unrolling can be easily performed by a compiler.  GCC will perform some sort of loop unrolling with optimization level 3.

Enhancing Parallelism

Remember that we have multiple functional units that can do addition and multiplication and that these units can work in parallel.  Also, these units are fully piped, meaning they can each start an new operation every clock tick even if their latencies are greater than 1 clock cycle.

We can get added performance gains beyond the latency bounds by rewriting our function to use parallelism.

Multiple Accumulators

For a combining operation that is associative and commutative, like integer addition and multiplication, we can improve performance by partitioning the set of all combining operations into 2 or more sets and combine the results in the end.

Suppose, Pn = a1 * a2 * … * an.

Then Pn = POn  * PEn                          // multiply the odd and the even seperately, then multiply the results

The following version of the combine method uses two-way loop unrolling as well as two-way parallelism.  It is therefore referred to as 2 x 2 loop unrolling.

/* 2 x 2 loop unrolling */
void combine6(vec_ptr v, data_t * dest)  {
    long i;
    long length = vec_length(v);
    long limit = length - 1;
    data_t * data = get_vec_start(v);
    data_t acc0 = IDENT;
    data_t acc1 = IDENT;

    /* Combine two elements at a time, in two different accumulators */
    for (i = 0; i < limit; i+=2) {
        acc0 = acc0 OP data[i];
        acc1 = acc1 OP data[i+1];
    }

    /* Finish any remaining */
    for (; i < length; i++) {
        acc0 = acc0 OP data[i];
    }

    *dest = acc0 OP acc1;
}
Method Integer (+) Integer (*) FP (+) FP (*)
No unrolling 1.27 3.01 3.01 5.01
2 x 1 unrolling 1.01 3.01 3.01 3.01
2 x 2 unrolling 0.81 1.51 1.51 2.51
Issue time 1.00 1.00 1.00 1.00
Capacity 4 1 1 2
Threshold bounds (CPE) 0.5 1.00 1.00 0.5

2×2 unrolling performance gains due to the fact that the issue time is 1.00 which means that at every cycle we can start a new operation.  When we increase the number of accumulators in the loop the performance increases up to the operator threshold bounds.  In general, a program can reach the throughput bound only when it keeps all of the usable functional units busy.

For an operation with a latency L and capacity C, the unrolling factor is k >= C * L gives the maximum performance.

Integer (+) Integer (*) FP (+) FP (*)
Latency 1 3 3 5
Capacity 4 1 1 2
k >= 4 3 3 10

Floating point arithmetic and multiplication are not associative.  Thus computing a solution that uses 2×1 unrolling may, in rare conditions, produce different result than a solution that uses 2×2 unrolling.

This may result if all even numbers are very large and the sum or product overflows and all of the numbers at the odd indices are close to zero and the sum or product underflow.

Since this is rare, check data before implementing this method.

Reassociation Transformation

acc = (acc OP data[i]) OP data[i+1];
//vs.
acc = acc OP (data[i] OP data[i+1]);    // 2 x 1a loop-unrolling

Only one of the mul ops forms a data-dependency chain between loop registers.  Thus we have half the ops on the critical path.

Method Integer (+) Integer (*) FP (+) FP (*)
2 x 1 unrolling 1.01 3.01 3.01 3.01
2 x 1a unrolling 1.01 1.51 1.51 2.51

k x 1a results simular to k x k results.

In General, they found that unrolling a loop and accumulating multiple values in parallels is more reliable.