You are a victim of branch predictor.
What is branch prediction?
Consider a railway junction:
Image by Mechanism, by Wikimedia Commons
Now, for my argument to make any sense suppose we’re in the eighteenth century, before the invention of radio and other media.
You are the trunking operator and you notice that there is a train coming. You have no idea which way it intends to go and what its destination is. You then make the train stop, ask the driver the destination, adjust the lever and release the train to depart.
The problem since method is that the train is a very heavy monster full of inertia and stopping/leaving is a process that can take minutes.
Is there a better way? Yes! Kick to where the train goes without asking.
- If you hit the train you’ll keep on going and everybody comes out winning.
- But if you miss the train you will need to stop, regress and leave.
If you always get it right the train will never need to stop.
If you miss too often the train will waste time stopping and regressing. But this is almost as bad as stopping to ask.
I admit that the analogy may not be so good, after all the train could signal the direction with a flag or something. But on computers there is no way to know which direction your code will take until the execution reaches it. This is the branch predictor.
But first of all, what is a branch?
Consider this conditional, a branch:
To decide whether r += array[i]
will or will not be executed should first read the value of array[i]
and check if he is smaller than 500
. The problem here is that modern processors can run multiple instructions at the same time as long as there are no dependencies between them. For this to work properly he needs to know what the next instructions are, and to wait until the conditional has solved is to underuse the computational power.
But modern processors are better than that. They blindly kick and take a path before the calculation is ready. When the parole’s ready, it checks out:
- If the kick was right, just keep running.
- If the kick was wrong, you should stop everything being done, undo what was wrongly executed and continue from the point right after parole;
If he always gets it right the execution will never need to stop.
If he misses too often execution will waste time stopping and regressing. But this is almost as bad as stopping to wait for the decision.
What processors do here is try to identify a pattern and follow it to increase the chances of getting a clean kick. The trunking operator may notice, for example, that in the morning most trains are going one way, while red-painted trains usually go the other way.
When your list is not sorted the distribution of values will be closely randomized and uniform. Thus the result of array[i] < 500
is totally unpredictable. This will result in about 50% error in the predictor.
But if your list is sorted, it will happen that for the first values of i
the result will always be true
since all these values are low. The predictor will learn that the condition is always true and constant. The moment the execution passes half the array, the result will start to be false
and the predictor will err for a few cycles until he notices the change and starts to assume that it will always be false. This gives an almost total hit rate.
The trick is to make all branches within critical sections easily predictable, giving you a lot of performance gain, even at the cost of previously sorting your data.
Although I don’t indicate writing code like that unless it’s really necessary, you can escape from relying on the predictor by writing the same code without any branch:
if (array[i] < 500)
r += array[i];
int flag = (array[i] - 500) >> 31; // right shift em valores signed não é portável
r += flag & array[i]; // Se a flag for zero, some zero
There are no conditions involved here, only bit manipulations. The first problem is that the behavior of this code is not immediately obvious to the reader, other than the original. The second, and perhaps most important, is that it only works if the right shift is implemented as a arithmetic shift, which implies that the sign will be preserved and will make every word be 1
s.
Testing with GCC 4.8.1, no optimization:
- Randomic, with branch: 22.82s
- Orderly, with branch: 8.483s
- Randomic, with no branch: 9.474s
- Orderly, no branch: 9.448s
Note that although order no longer makes any difference, the result is slower than when the predictor was in action and hitting, as we have more operations to perform.
But what if we connect the optimizations?
So your test numbers change a little bit:
- Randomic, with branch: 3.249s
- Orderly, with branch: 3.242s
- Randomic, with no branch: 2.565s
- Orderly, no branch: 2.569s
Here GCC is smart enough to turn the branch into one conditional move:
mov ecx, DWORD PTR [esi+edx*4] ; a = array[i]
cmp ecx, 499 ; c = a <= 499
lea ebx, [ecx+eax] ; t = r + a
cmovle eax, ebx ; if (c) r = t
The magic here is that the last instruction nay is a branch, is executed as just an instruction and does not stop the processor from starting to execute the instructions that comes next, there are no jumps to other parts of the code, it is all linear.
Still he wasn’t good enough to overcome our hack with bit operations, which looks like this:
mov ebx, DWORD PTR [esi+edx*4] ; a = array[i]
lea ecx, [ebx-500] ; b = a - 500
sar ecx, 31 ; b = b << 31
and ecx, ebx ; b = b & a
add eax, ecx ; r += b
Although it has more instructions, it runs faster in my tests.
You can also help the compiler and the predictor by telling the code the likely outcome of a branch. The code is different for each compiler, but in GCC you can write the following:
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
if (likely(array[i] < 950))
r += array[i];
It is a widely used technique in the linux kernel. But it is necessary to test to know if it will really be faster than leaving the burden of the predictor in its specific location. It can also contribute to compiler optimizations.
Note that all tests here can and will vary from processor to processor. Results that are better in mine may not be better in yours. For most cases it is best to write simple, readable code and leave it to the compiler to decide how to run it.
Based on an OS response
Famous boot on it. It’s simply the question with more votes from the OS.
– bfavaretto
I tried not simply to translate, but to rewrite in my own way. I love this question.
– Guilherme Bernal
The idea of stackoverflow in Portuguese is to copy questions from English to gain reputation?
– McLeary
@Mcleary I’m not trying to get a reputation, I just think the question is relevant and will bring visibility to the site. And the amount of "translated" questions should always be insignificant in the flow of real questions. I see no problem.
– Guilherme Bernal
@Mcleary This has already been debated in the community and bring highly relevant information here, giving due credit is encouraged. It is not to bring the whole OS, but what is very cool that does not exist in Portuguese is officially encouraged.
– Maniero
Some people bring irrelevant questions. They just don’t get votes. But there are cases, like this, that really deserve the vote, after all it is not so easy bring up a question like that. And the job of translating it? Besides, from what I’ve observed since the private beta, Guilherme knows what he’s talking about and he doesn’t need it to earn points.
– utluiz