Is there a difference in finding the maximum value directly in SQL or checking it with PHP?

Asked

Viewed 224 times

3

I saw in a question on the site a user get the maximum value of a given column of a table in the database as follows:

$result_max = mysqli_query($conn, "SELECT CGenericas1 FROM AvaliacaoDesempenho");

while($rows_cursos1 = mysqli_fetch_array($resultado_cursos1)) {
    if ($maior < $teste = $rows_cursos1['CGenericas1'])
    $maior = $teste = $rows_cursos1['CGenericas1'];
}

That is, selects all records from the table, iterates them with PHP and compares the values until you find the largest.

I know that we can also get the same result directly by SQL, with the function max():

$result_max = mysqli_query($conn, "SELECT max(DCGenericas) AS max_page FROM AvaliacaoDesempenho");
$rows = mysqli_fetch_array($result_max);
$maximo =  $rows["max_page"];

Which form would be more performative, consumes less resources, faster etc...? If any, what are the reasons?

1 answer

13

Short answer: the second is faster and consumes fewer resources, by having less data to transmit and less redundant processing.


Now a basic answer to meet the need of the question author and a user who commented (deleted now by the lack of relevance with the current version) that they wanted a basic answer (despite the question ask reasons) and that my question is too complex:

Letting DB do the bulk of the processing is better than letting PHP in both memory and speed.

Which form is more performative?

The second. The database will process everything you need to find the biggest one and deliver unique information, something very simple and fast to transmit and the application does something minimal to take and use this information. The other will generate a heavy processing in PHP, something that has already been done in the database, aside from transmitting too much data

Which form consumes the least resources?

In the database the consumption is essentially the same. In PHP the second way consumes much less resources (I imagine I’m talking mainly about memory) because it has to deal with information, against a huge amount of information in the first way.

Which way is faster?

I imagine it’s the same as the first answer, after all performatic tends to send us to speed. If it’s not that I can change if the question becomes clearer that it’s about something else

etc....

I wouldn’t know what else I could be talking about, but I can improve if it’s made explicit.

What are the reasons?

The first has to transmit all data (all lines resulting from the query) for PHP to be able to process, this in itself weighs a lot more. And you have to process all the lines once in the database and once in PHP.


Now a response a little deeper, but far from being academic, for those who want to understand better why this occurs.

First I have to say that it is always possible to be different for a number of reasons, and in some cases make no real difference, think for example if the table has only one row, what difference will it make? Essentially zero. But under normal conditions with high data volume the database processing will be absurdly faster, after all in most cases the bulk of the time spent is data transfer and does not consult it itself.

In any case the database will have to read all the rows of the table to get the necessary data, either to send to PHP, or to calculate the maximum.

Calculating the maximum is slightly slower in the database because it has extra processing to do there. But the difference is derisory, it is likely to be well below 0.1%, since access to data costs much more. But that’s only if you consider the work on the database side. It changes everything when you add up the whole process.

Imagine having to transmit thousands or millions of data (table rows) to the application just to find out which one is the largest. Compare with just transmitting the biggest, single information. The difference seems brutal, right?

And finally in PHP, I think it’s pretty obvious that the second has a very small, almost negligible cost for processing and memory consumption. The first has to process all the data, then it will consume more memory (of course much of it will be recycled at each step, whenever the allocated memory is no longer needed, then the total consumption is not so big, but there will be a lot of allocation and release, which in some cases may be worse for the performance of the application.

To understand this we have to seek something called algorithm complexity (alert: there the accepted answer is well academic, I do not advise to follow it for those who do not have a very strong basis of computing, but it is great for a "mathematical" deepening of the question). What I’m going to say here is just a nail polish on the subject.

Of course, every beginner has no obligation to know this, but at some point knowing about this subject, among others, of course, is what defines who is a professional programmer and the speaker who only decorates cake recipes and can’t make proper decisions on his own. Tip: Most performance problems (some not even noticeable) are due to lack of basic understanding of this (you don’t need to go deeper to use it very well with great benefit. And of course everyone has the right to ignore the subject, by quota and risk.

I will use the Big O notation to indicate each complexity (there will only be two). You can better understand this in the excellent reply from Luiz Vieira.

I will talk about time complexity (execution). I will not go into the complexity of space (memory), because in practice it will not make so much difference, except in the obviousness.

There are three complexities to be verified in all this process:

  • processing in the database
  • the transmission of data
  • the processing of the application

Processing in the database

Both have potentially a linear complexity (O(n)), that is, for each existing line the algorithm will have to execute a step. Having N lines, you will have N steps.

Learning about complexity, it doesn’t define time, and one can be faster than another, but the difference won’t be great. As said earlier, calculating the maximum has a slightly higher cost because in each step has a comparison and an assignment of value, but this is done at the moment you are already reading the data, and this processing is very small.

Data transmission

Roughly the first has linear complexity, that is O(n). The second has constant complexity, therefore O(1). Then the second will have one step no matter how many lines were consulted. This is where the secret to the brutal difference in performance lies. The first can take many milliseconds, or even seconds, minutes, or even hours or days in extreme cases, and the second will usually be less than 1 millisecond. Always considering only the transmission of data.

Note that this it is not so easy to make this account because the way to transmit depends on a number of issues and there are some amortizations. In fact what matters most is up to the space occupied during the broadcast to determine the time. If need be the account becomes too complex, and unnecessary for most scenarios.

These notations of Big O usually work only with the basic and best known formulas to obtain an approximation of real complexity. This is useful for almost every case we analyze this kind of thing. But keep in mind that the real complexity is an O(complex formula here).

In short: if it has 1 billion lines, roughly speaking, the first form will take time to transmit all these lines, and the second form will be a simple package. Not on account of overhead It won’t be a billion times faster. If there are 10 lines, it will potentially take the same time because it is likely that only a packet will contain the line data or the calculated maximum. Note that if the query sends many columns there changes picture, but for this case it does not make sense.

All this is true if you don’t need the lines in the application, only the maximum. That’s what the question seems to imply. If you need the lines for something else, then it starts to look the same. Not quite, but this is another scenario and I won’t go into details here.

Whenever you can decrease the amount of data to traffic between the DB and the application you will have some gain.

Application processing

Here the difference is also brutal. Again the first is O(n) versus O(1) of the second. It is the second biggest point that will give difference, in some very specific and extreme cases can even become the main point. PHP will have a loop in the first, I think that’s pretty obvious to everyone as this will be slower than taking the single die from the maximum, plain and simple.

Nor did I mention that in DB the verification will run optimally because it was done in a language of high efficiency and by a good programmer. In PHP there is not nearly the same efficiency and maybe the programmer is not that good, but this comparison is more curious than useful. Even I find it strange to have assignment in $teste in this code, so the comparison is even more unfair, but gives an indication that it is easier to have an error that makes it piro in PHP than in the database. One of the wonders of an SGDB is that it optimizes the way to do the query, it helps the bad programmer get along better, although it continues to require certain care.

Examples of questions that show complexities:

  • 4

    +1, practically saved the question.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.