In this article I describe an optimization technique I've used to squeeze an extra 10% or so out of C/C++ code.


A common idiom in many programming languages is the following, in which you have an array of some type (possibly just an array of integers) and an index into that array:

T   array[100];
int index;
  
void foo() {
  // do something with array[index] and increment index by 1
}

Often the code above will be encapsulated in a class, and the pattern may not be immediately obvious. I've run into this idiom in several problems lately at home and at work, so I thought I'd explain how to optimize it to get better performance.

You see, there's a hidden penalty in this code, which is insignificant unless foo() is invoked frequently (say, in a loop). The penalty is that the compiler has to multiply the index by the size of T, in order to find the actual byte offset into the array. That is, when you write array[index], the compiler really has to compute ((byte*)array)[sizeof(T) * index]. For most types T — pointers, integers, structures, ... — this requires a multiplication or bit shift.

When this array indexing occurs in a loop, we talk about the loop stride. The stride is the amount by which the index is increased each time through the loop. Consider, for example:

for (int i = 0; i < 100; i++) {
  // do something with array[i];
}

In this example, we're incrementing the loop variable i by 1 each time through the loop, so the stride is 1. Except that to the compiler, the stride is actually 1*sizeof(T) — that's the distance we're actually advancing through the array. In this example, the index variable is local to the function, so the compiler can automatically pre-multiply it by the sizeof(T), generating code that looks like this:

int stride = sizeof(T);
for (int i = 0; i < 100; i += stride) {
  // do something with ((byte *)array)[i]
}

This hoists the multiplication out of the loop, which can be a big savings for large loops.

However, whenever the index is visible to other functions (for example, if it's a class member), the compiler often can't perform this optimization. If the index is never or rarely accessed by outside code, then you can manually optimize the loop stride just like the compiler would have done. Going back to the original example, you would change it to the following:

T   array[100];
int index;
  
void someInternalFunction() {
  // do something with ((byte*)array)[index] and increment index by sizeof(T)
}

You can see this optimization used in my implementation of the random number generator R250. When compiling into PPC code using g++ and full compiler optimizations, this simple optimization eliminated four shlwi instructions and improved the performance of R250 by about 12%.


Disclaimer and copyright information