Why is processing an ordered list faster than an unordered one?

Question

Why is processing an ordered list faster than an unordered one?

Asked 11 years, 11 months ago

Viewed 443 times

28

I have a C++ code that for some inexplicable reason seems to run much faster when my data is previously sorted. I was able to reproduce the behavior with the following code:

#include <algorithm>
#include <ctime>
#include <cstring>
#include <iostream>
#define LENGTH 20000000 // 80 MB

unsigned process(int* array) {
    unsigned r = 0;
    for (int i = 0; i < LENGTH; ++i)
        if (array[i] < 500)
            r += array[i];
    return r;
}

int main() {
    int* array1 = new int[LENGTH];
    for (int i = 0; i < LENGTH; ++i)
        array1[i] = std::rand() % 1000;

    int* array2 = new int[LENGTH];
    std::memcpy(array2, array1, LENGTH * sizeof(int));
    std::sort(array2, array2 + LENGTH);

    std::clock_t start = clock();
    for (int i = 0; i < 200; ++i) process(array1);
    double time1 = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    start = clock();
    for (int i = 0; i < 200; ++i) process(array2);
    double time2 = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    std::cout << "array1\t" << time1 << std::endl;
    std::cout << "array2\t" << time2 << std::endl;
}

I have the following result (compiled with g++ 4.8.1, no optimization):

array1  22.86
array2  8.557

A speed increase of almost 3 times! Initially I suspected it was some sort of optimization that the compiler might be doing with my code, taking advantage of the fact that my list is sorted. So I decided to implement the same idea in a completely different language that wasn’t optimized: I chose Ruby.

require 'benchmark'

array1 = []
20000000.times { array1 << rand(1000) }
array2 = array1.sort

Benchmark.bm do |x|
    x.report("array1") do
        r = 0
        array1.each {|e| r += e if e < 500 }
    end
    x.report("array2") do
        r = 0
        array2.each {|e| r += e if e < 500 }
    end
end

The result running with the Ruby 2.0.0 (running 200 times less for being slower):

             user     system      total        real
array1  13.151000   0.000000  13.151000 ( 13.144685)
array2   8.299000   0.000000   8.299000 (  8.293171)

Here the performance gain is much lower (little more than 1.5x), but clearly significant.

What exactly is going on here? Where does this effect come from and how can I take advantage of it?

^{Based on a famous OS question}

3

Famous boot on it. It’s simply the question with more votes from the OS.

– bfavaretto

2014/01/30 at 15:07
2

I tried not simply to translate, but to rewrite in my own way. I love this question.

– Guilherme Bernal

2014/01/30 at 15:22
1

The idea of stackoverflow in Portuguese is to copy questions from English to gain reputation?

– McLeary

2014/01/30 at 15:33
5

@Mcleary I’m not trying to get a reputation, I just think the question is relevant and will bring visibility to the site. And the amount of "translated" questions should always be insignificant in the flow of real questions. I see no problem.

– Guilherme Bernal

2014/01/30 at 15:36
5

@Mcleary This has already been debated in the community and bring highly relevant information here, giving due credit is encouraged. It is not to bring the whole OS, but what is very cool that does not exist in Portuguese is officially encouraged.

– Maniero

2014/01/30 at 15:44
4

Some people bring irrelevant questions. They just don’t get votes. But there are cases, like this, that really deserve the vote, after all it is not so easy bring up a question like that. And the job of translating it? Besides, from what I’ve observed since the private beta, Guilherme knows what he’s talking about and he doesn’t need it to earn points.

– utluiz

2014/01/30 at 17:45

Show 1 more comment

1 answer

Browser other questions tagged optimizing performance

You are not signed in. Login or sign up in order to post.

by Guilherme Bernal • **20,024** points · Answer 1 · 2014-01-30T15:05:30+00:00

You are a victim of branch predictor.

What is branch prediction?

Consider a railway junction:

^{Image by Mechanism, by Wikimedia Commons}

Now, for my argument to make any sense suppose we’re in the eighteenth century, before the invention of radio and other media.

You are the trunking operator and you notice that there is a train coming. You have no idea which way it intends to go and what its destination is. You then make the train stop, ask the driver the destination, adjust the lever and release the train to depart.

The problem since method is that the train is a very heavy monster full of inertia and stopping/leaving is a process that can take minutes.

Is there a better way? Yes! Kick to where the train goes without asking.

If you hit the train you’ll keep on going and everybody comes out winning.
But if you miss the train you will need to stop, regress and leave.

If you always get it right the train will never need to stop.
If you miss too often the train will waste time stopping and regressing. But this is almost as bad as stopping to ask.

I admit that the analogy may not be so good, after all the train could signal the direction with a flag or something. But on computers there is no way to know which direction your code will take until the execution reaches it. This is the branch predictor.

But first of all, what is a branch?

Consider this conditional, a branch:

To decide whether r += array[i] will or will not be executed should first read the value of array[i] and check if he is smaller than 500. The problem here is that modern processors can run multiple instructions at the same time as long as there are no dependencies between them. For this to work properly he needs to know what the next instructions are, and to wait until the conditional has solved is to underuse the computational power.

But modern processors are better than that. They blindly kick and take a path before the calculation is ready. When the parole’s ready, it checks out:

If the kick was right, just keep running.
If the kick was wrong, you should stop everything being done, undo what was wrongly executed and continue from the point right after parole;

If he always gets it right the execution will never need to stop.
If he misses too often execution will waste time stopping and regressing. But this is almost as bad as stopping to wait for the decision.

What processors do here is try to identify a pattern and follow it to increase the chances of getting a clean kick. The trunking operator may notice, for example, that in the morning most trains are going one way, while red-painted trains usually go the other way.

When your list is not sorted the distribution of values will be closely randomized and uniform. Thus the result of array[i] < 500 is totally unpredictable. This will result in about 50% error in the predictor.

But if your list is sorted, it will happen that for the first values of i the result will always be true since all these values are low. The predictor will learn that the condition is always true and constant. The moment the execution passes half the array, the result will start to be false and the predictor will err for a few cycles until he notices the change and starts to assume that it will always be false. This gives an almost total hit rate.

The trick is to make all branches within critical sections easily predictable, giving you a lot of performance gain, even at the cost of previously sorting your data.

Although I don’t indicate writing code like that unless it’s really necessary, you can escape from relying on the predictor by writing the same code without any branch:

if (array[i] < 500)
    r += array[i];

int flag = (array[i] - 500) >> 31; // right shift em valores signed não é portável
r += flag & array[i]; // Se a flag for zero, some zero

There are no conditions involved here, only bit manipulations. The first problem is that the behavior of this code is not immediately obvious to the reader, other than the original. The second, and perhaps most important, is that it only works if the right shift is implemented as a arithmetic shift, which implies that the sign will be preserved and will make every word be 1s.

Testing with GCC 4.8.1, no optimization:

Randomic, with branch: 22.82s
Orderly, with branch: 8.483s
Randomic, with no branch: 9.474s
Orderly, no branch: 9.448s

Note that although order no longer makes any difference, the result is slower than when the predictor was in action and hitting, as we have more operations to perform.

But what if we connect the optimizations?

So your test numbers change a little bit:

Randomic, with branch: 3.249s
Orderly, with branch: 3.242s
Randomic, with no branch: 2.565s
Orderly, no branch: 2.569s

Here GCC is smart enough to turn the branch into one conditional move:

mov       ecx, DWORD PTR [esi+edx*4]  ; a = array[i]
cmp       ecx, 499                    ; c = a <= 499
lea       ebx, [ecx+eax]              ; t = r + a
cmovle    eax, ebx                    ; if (c) r = t

The magic here is that the last instruction nay is a branch, is executed as just an instruction and does not stop the processor from starting to execute the instructions that comes next, there are no jumps to other parts of the code, it is all linear.

Still he wasn’t good enough to overcome our hack with bit operations, which looks like this:

mov       ebx, DWORD PTR [esi+edx*4]  ; a = array[i]
lea       ecx, [ebx-500]              ; b = a - 500
sar       ecx, 31                     ; b = b << 31
and       ecx, ebx                    ; b = b & a
add       eax, ecx                    ; r += b

Although it has more instructions, it runs faster in my tests.

You can also help the compiler and the predictor by telling the code the likely outcome of a branch. The code is different for each compiler, but in GCC you can write the following:

#define likely(x)      __builtin_expect(!!(x), 1)
#define unlikely(x)    __builtin_expect(!!(x), 0)

if (likely(array[i] < 950))
    r += array[i];

It is a widely used technique in the linux kernel. But it is necessary to test to know if it will really be faster than leaving the burden of the predictor in its specific location. It can also contribute to compiler optimizations.

Note that all tests here can and will vary from processor to processor. Results that are better in mine may not be better in yours. For most cases it is best to write simple, readable code and leave it to the compiler to decide how to run it.

^{Based on an OS response}