There are controversies. That can be seen in this question. Read everything, including the comments, there is something very good there, including showing how things actually happen in practice.
It seems to me that you understand that measuring the complexity of an algorithm is not measuring its speed. There is a relationship between the two, but it is not so direct.
Of course it is impossible to take a lot of data and have an O(n m) faster than an O(1) on any hardware.
The algorithm, the way people do it, doesn’t take hardware conditions into account.
The most common is to classify efficiency within a neutral environment. In general, optimizations are not considered.
Imagine running this on an overloaded computer. Every theory evaporates. Imagine a computer that spends more time doing swap on disk rather than in calculus. What good is knowing in theory which algorithm efficiency?
In theory one can consider hardware. One can add variables that consider these conditions. But it is very rare to have advantage in knowing this. In general this is done in a very specific academic or analytical environment and when something is being done that will have a lot of use in something that potentially needs a lot of performance. So there are cases to consider all variables. On a day-to-day basis you do a superficial efficiency analysis to choose the correct algorithm and do real speed tests with typical and extreme data to see what actually happens under real conditions. It usually gives more adequate results from an engineering point of view, although from a scientific point of view it is not necessarily the best.
I have seen people say that there are calculations in this sense, but I have never seen anyone showing them. I know they exist, but they are well hidden :)
That said, my experience shows that caching makes more difference than branch Prediction in most cases. Algorithms that have less branches can be more efficient on simpler machines.
An algorithm that must manipulate data on disk may be the most suitable, while the same data in memory another algorithm may be more suitable. Imagine that an L1 cache or even registers could hold all the data to be processed as there could be improvement.
Remembering that for few data (low N), in practice, it does not matter which algorithm used.
Look at this External and internal memory sorting algorithms.
The processor is so complex that no one should even know this answer, just by testing it. But these sorting algorithms are so different that it is highly unlikely that the hardware will influence the comparison result you are trying to measure. The important thing is to use the same hardware for measurement.
– fabio di luca