General considerations on data processing
First of all, it is important to mention that in most cases there is no form of data processing that is faster for any data volume.
When we adjust application performance, it is usually necessary to analyze how, when and in what context the most critical points of the system are used. From this analysis it will be possible to identify solutions ranging from adjustments in the database (de-normalization, creation of indexes, change of data types), through the way the data is recovered (cursors, buffers, ordination) and go up to the use of caches local or distributed.
Finally, the solution depends on many variables and there is no general rule.
Retrieving a single line
Small tables hardly require any optimization. You can read it whole and it will not affect the overall performance of the application.
Unless, of course, it is read thousands of times per second. In this case you can store a copy of the data in memory.
Now, if you wanted to recover a single line out of thousands, the best solution is to have a good index that exactly follows your clause WHERE
.
Consider the following consultation:
select * from Pessoa where tipo_pessoa = ? and CPF = ?
In this case, it would be ideal to have an index on both fields tipo_pessoa
and CPF
.
Recovering all rows from a table
Reading an entire table at once without getting an idea of the amount of records is a challenge.
The solutions are divided basically into two types: Leitura Completa and Partial Reading.
Leitura Completa
Reading the table in full will be the most appropriate solution if there’s enough memory for that. After all, there will be no need to "go back and forth" to the database to recover new values.
However, depending on what is in the table, a volume of 1 million records will probably occupy more memory than we wish.
At first we can even come to the conclusion that we can have all these records in memory. However, in most cases, multiple users will be accessing the system at the same time. Here comes the question: how many users we want to meet?
Suppose the records occupy 100 Megabytes in memory. If we have a server with 1 Gigabyte free, making a very simple account, then our system would serve 10 users well. Going from that, disk memory paging would likely degrade program performance until it is disabled.
Another problem is that loading all the data in memory takes time. The user would feel a difference between a system that reads 1 million records and then writes everything at once to it and another that sends partial data, even if the total time is a little longer. This is why it ends up being necessary to make pagination in many systems.
Therefore, full reading of all large table data is almost always unsuitable for web systems.
Partial Reading
To avoid the problem with memory and response time, there is then the alternative of reading the data partially.
There’s only one way to do it.
Recovering blocks with LIMIT
One approach is to perform various queries that retrieve different blocks of records using LIMIT
.
This means that you should set a block size and run several consecutive queries. An example of executed queries is:
select * from Pessoa where tipo_pessoa = ? and CPF = ? LIMIT 0, 10
select * from Pessoa where tipo_pessoa = ? and CPF = ? LIMIT 10, 10
select * from Pessoa where tipo_pessoa = ? and CPF = ? LIMIT 20, 10
select * from Pessoa where tipo_pessoa = ? and CPF = ? LIMIT 30, 10
....
The problem with this approach is that there is a overhead between each execution. The database will have one more work to execute each block.
Full consultation, partial results
Another way to recover partial results is to use a query that selects all records from the table, but does not return all at once to the database.
We avoid the need for the database to create several result sets and also to load all data in memory.
The idea is that there is in the database a kind of cursor that reads the data as we go recovering.
In client code (PHP), as we use records, we discard variables so that they can be out of memory.
The weakness of this approach is to keep a resource open on the server longer.
Full and Partial Readings in PHP
The PHP Handbook has a topic on Buffered result sets and Unbuffered result sets.
The buffered result sets are results that load all returned lines into memory.
Example:
$res = $mysqli->query("SELECT id FROM test ORDER BY id ASC");
for ($row_no = $res->num_rows - 1; $row_no >= 0; $row_no--) {
$res->data_seek($row_no);
$row = $res->fetch_assoc();
echo " id = " . $row['id'] . "\n";
}
As everything is in memory, you can move to any position of the result vector using the method data_seek(n)
.
On the other hand, the Unbuffered result sets go through the results without storing them in memory, being indicated when not enough memory is available.
Example:
$mysqli->real_query("SELECT id FROM test ORDER BY id ASC");
$res = $mysqli->use_result();
echo "Result set order...\n";
while ($row = $res->fetch_assoc()) {
echo " id = " . $row['id'] . "\n";
}
Case study
A few days ago here where I work an application was suffering from performance problems to read the result of a query that returned approximately 30,000 lines, with subsequent generation of a text file. The original running time was approximately 20 minutes.
First, we checked whether the consultation itself was not the problem. We found some problems, but nothing that justified the 20 minutes, because it took approximately 1 minute.
Second, we checked the generation of the file and also ruled out that this was the problem.
Third, we also found that the data was all loaded in memory, but it was an acceptable load for the server in use.
Finally, we identified that the problem was the reading time of the results of the query for Java.
Searching about the Driver for the Oracle database, we saw that by default it buffer 10 records of each query. This means that by scrolling through the 30,000 records, the Driver transfers 10 at a time.
We changed the buffer parameter to 100 and the performance improved a lot. Now instead of 3,000 calls to the bank (10 out of 10), it was only 300 (100 out of 100). We did several tests and reached the value of 300 buffer records for that environment. More or less than that made the performance worse. In addition, the results were not the same in other environments with different amount of available memory.
Final time is down to two minutes. With the same code, just modifying a certain parameter, we completely change the scenario of that functionality.
Unfortunately, I did not find a parameter similar to the one mentioned above for PHP.
Other tips
In addition to everything mentioned above, some tips may be helpful:
- Select as few columns as possible in your query to minimize memory usage and transfer time
- Always sort the data in a way that uses an index
- It is possible to count how many records the table has and use an alternative algorithm depending on the situation
Completion
There are several mechanisms for reading a lot of data.
However, the only absolute conclusion is that testing will be necessary to determine the best solution. Do not forget to consider the environment where the system will run (CPU, memory, disk, etc.).
So I think the best is a single query. (Not put as answer because I don’t know much else to say.)
– bfavaretto
From which programming language do you get this data? The answer to my question is decisive to answer yours. If you are going to use C#, for example, the most efficient is 1 single SELECT, using a Datareader to process the results. If it is Delphi, you do a server-side cursor query and do a single query as well. If it is a precedent in Mysql itself, there are specific methods. It is also important to consider what you will do with the data obtained from the query.
– Caffé
@Jorgeb. In the case of many records, the approach of my answer is exactly what you need, because you can already write the file while the records return from the base. You take the time to do two heavy operations in parallel. And this approach will not be slower in the case of few records. It remains to be seen whether PHP offers this feature. From what I’ve seen, the solution is in this documentation: http://php.net/manualen/pdostatement.fetch.php. You can test these cursors and find out which returns the control immediately for the code while the query is still running.
– Caffé
@Jorgeb. If I am now better understanding the spirit of your original question, I believe you were concerned about the bank’s behavior regarding the three alternatives you proposed. Make no doubt that in this scenario (1 table, all records) the only one that makes sense for the bank is the third one (make a single SELECT). If at some point you noticed some advantage in the other options, this was precisely due to the environment the way you tested (strategy of execution of the query in the application, consumption of the obtained result and time measurement technique).
– Caffé