What is the most efficient way to select items from a table?

Asked

Viewed 2,269 times

5

Having the following table:

| id | name | email | coutry |
+----+------+-------+--------+

What shape more efficient (faster) to select ALL the rows of the table knowing that the table may have 1 row or millions of rows

  • cycle and do a SELECT per line?
  • cycle with N SELECT's of X lines by SELECT?
  • make a SELECT of all rows in the table?
  • another way? which?

Note: Using a limit of X lines per SELECT How many lines will I search if I have 10,000 records? What if I have 1 million? Is there an algorithm?

I limited Mysql and innodb to not make the question too broad.

[EDIT]

Knowing that these lines will be used in PHP to export to a file .xml with all table fields.

  • 1

    So I think the best is a single query. (Not put as answer because I don’t know much else to say.)

  • From which programming language do you get this data? The answer to my question is decisive to answer yours. If you are going to use C#, for example, the most efficient is 1 single SELECT, using a Datareader to process the results. If it is Delphi, you do a server-side cursor query and do a single query as well. If it is a precedent in Mysql itself, there are specific methods. It is also important to consider what you will do with the data obtained from the query.

  • 1

    @Jorgeb. In the case of many records, the approach of my answer is exactly what you need, because you can already write the file while the records return from the base. You take the time to do two heavy operations in parallel. And this approach will not be slower in the case of few records. It remains to be seen whether PHP offers this feature. From what I’ve seen, the solution is in this documentation: http://php.net/manualen/pdostatement.fetch.php. You can test these cursors and find out which returns the control immediately for the code while the query is still running.

  • 1

    @Jorgeb. If I am now better understanding the spirit of your original question, I believe you were concerned about the bank’s behavior regarding the three alternatives you proposed. Make no doubt that in this scenario (1 table, all records) the only one that makes sense for the bank is the third one (make a single SELECT). If at some point you noticed some advantage in the other options, this was precisely due to the environment the way you tested (strategy of execution of the query in the application, consumption of the obtained result and time measurement technique).

3 answers

5


General considerations on data processing

First of all, it is important to mention that in most cases there is no form of data processing that is faster for any data volume.

When we adjust application performance, it is usually necessary to analyze how, when and in what context the most critical points of the system are used. From this analysis it will be possible to identify solutions ranging from adjustments in the database (de-normalization, creation of indexes, change of data types), through the way the data is recovered (cursors, buffers, ordination) and go up to the use of caches local or distributed.

Finally, the solution depends on many variables and there is no general rule.

Retrieving a single line

Small tables hardly require any optimization. You can read it whole and it will not affect the overall performance of the application.

Unless, of course, it is read thousands of times per second. In this case you can store a copy of the data in memory.

Now, if you wanted to recover a single line out of thousands, the best solution is to have a good index that exactly follows your clause WHERE.

Consider the following consultation:

select * from Pessoa where tipo_pessoa = ? and CPF = ?

In this case, it would be ideal to have an index on both fields tipo_pessoa and CPF.

Recovering all rows from a table

Reading an entire table at once without getting an idea of the amount of records is a challenge.

The solutions are divided basically into two types: Leitura Completa and Partial Reading.

Leitura Completa

Reading the table in full will be the most appropriate solution if there’s enough memory for that. After all, there will be no need to "go back and forth" to the database to recover new values.

However, depending on what is in the table, a volume of 1 million records will probably occupy more memory than we wish.

At first we can even come to the conclusion that we can have all these records in memory. However, in most cases, multiple users will be accessing the system at the same time. Here comes the question: how many users we want to meet?

Suppose the records occupy 100 Megabytes in memory. If we have a server with 1 Gigabyte free, making a very simple account, then our system would serve 10 users well. Going from that, disk memory paging would likely degrade program performance until it is disabled.

Another problem is that loading all the data in memory takes time. The user would feel a difference between a system that reads 1 million records and then writes everything at once to it and another that sends partial data, even if the total time is a little longer. This is why it ends up being necessary to make pagination in many systems.

Therefore, full reading of all large table data is almost always unsuitable for web systems.

Partial Reading

To avoid the problem with memory and response time, there is then the alternative of reading the data partially.

There’s only one way to do it.

Recovering blocks with LIMIT

One approach is to perform various queries that retrieve different blocks of records using LIMIT.

This means that you should set a block size and run several consecutive queries. An example of executed queries is:

select * from Pessoa where tipo_pessoa = ? and CPF = ? LIMIT 0, 10
select * from Pessoa where tipo_pessoa = ? and CPF = ? LIMIT 10, 10
select * from Pessoa where tipo_pessoa = ? and CPF = ? LIMIT 20, 10
select * from Pessoa where tipo_pessoa = ? and CPF = ? LIMIT 30, 10
....

The problem with this approach is that there is a overhead between each execution. The database will have one more work to execute each block.

Full consultation, partial results

Another way to recover partial results is to use a query that selects all records from the table, but does not return all at once to the database.

We avoid the need for the database to create several result sets and also to load all data in memory.

The idea is that there is in the database a kind of cursor that reads the data as we go recovering.

In client code (PHP), as we use records, we discard variables so that they can be out of memory.

The weakness of this approach is to keep a resource open on the server longer.

Full and Partial Readings in PHP

The PHP Handbook has a topic on Buffered result sets and Unbuffered result sets.

The buffered result sets are results that load all returned lines into memory.

Example:

$res = $mysqli->query("SELECT id FROM test ORDER BY id ASC");
for ($row_no = $res->num_rows - 1; $row_no >= 0; $row_no--) {
    $res->data_seek($row_no);
    $row = $res->fetch_assoc();
    echo " id = " . $row['id'] . "\n";
}

As everything is in memory, you can move to any position of the result vector using the method data_seek(n).

On the other hand, the Unbuffered result sets go through the results without storing them in memory, being indicated when not enough memory is available.

Example:

$mysqli->real_query("SELECT id FROM test ORDER BY id ASC");
$res = $mysqli->use_result();

echo "Result set order...\n";
while ($row = $res->fetch_assoc()) {
    echo " id = " . $row['id'] . "\n";
}

Case study

A few days ago here where I work an application was suffering from performance problems to read the result of a query that returned approximately 30,000 lines, with subsequent generation of a text file. The original running time was approximately 20 minutes.

First, we checked whether the consultation itself was not the problem. We found some problems, but nothing that justified the 20 minutes, because it took approximately 1 minute.

Second, we checked the generation of the file and also ruled out that this was the problem.

Third, we also found that the data was all loaded in memory, but it was an acceptable load for the server in use.

Finally, we identified that the problem was the reading time of the results of the query for Java.

Searching about the Driver for the Oracle database, we saw that by default it buffer 10 records of each query. This means that by scrolling through the 30,000 records, the Driver transfers 10 at a time.

We changed the buffer parameter to 100 and the performance improved a lot. Now instead of 3,000 calls to the bank (10 out of 10), it was only 300 (100 out of 100). We did several tests and reached the value of 300 buffer records for that environment. More or less than that made the performance worse. In addition, the results were not the same in other environments with different amount of available memory.

Final time is down to two minutes. With the same code, just modifying a certain parameter, we completely change the scenario of that functionality.

Unfortunately, I did not find a parameter similar to the one mentioned above for PHP.

Other tips

In addition to everything mentioned above, some tips may be helpful:

  • Select as few columns as possible in your query to minimize memory usage and transfer time
  • Always sort the data in a way that uses an index
  • It is possible to count how many records the table has and use an alternative algorithm depending on the situation

Completion

There are several mechanisms for reading a lot of data.

However, the only absolute conclusion is that testing will be necessary to determine the best solution. Do not forget to consider the environment where the system will run (CPU, memory, disk, etc.).

  • 1

    Excellent reply @utluiz.

2

The technique for processing, in the application, a large volume of data obtained from a DBMS is:

  1. Fire query (SELECT) asynchronously,
  2. Immediately start processing the returned records,
  3. Continue processing logs as they are delivered by the server (usually the server has not even finished selecting all records),
  4. Discard already processed records in order to free up memory for new records that are still coming,
  5. Identify that there are no more records to be delivered by the server and end the process.

It is logical that this technique only applies when records are obtained for reading only (its processing will lead to the change of other tables or no table) and when the processing logic allows a read only ahead (no intention to re-process an already read record).

How to implement this technique varies according to the programming language used.

In C#, for example, the DataReader:

using (connection)
{
    OleDbCommand command = new OleDbCommand(
      "SELECT CategoryID, CategoryName FROM Categories;",
      connection);
    connection.Open();

    // aqui o select é executado contra o banco
    OleDbDataReader reader = command.ExecuteReader();

    // o controle é retornado para o código imediatamente,
    // ou seja, a linha abaixo pode ser executada mesmo antes de o servidor
    // terminar de selecionar o registros
    if (reader.HasRows)
    {
        // a linha abaixo provoca a leitura de uma linha
        while (reader.Read())
        {
            Console.WriteLine("{0}\t{1}", reader.GetInt32(0),
                reader.GetString(1));

        } // os dados já lidos ficam disponíveis para serem liberados da memória
    }
    reader.Close();
}

Note that this approach is similar to yours of repeating several times the SELECT limiting the amount of records returned in each; but this one here using programming platform resources is much more performative.

Programming only in Mysql (stored procedures) it is also possible to implement this technique using cursors.

Update: since the question was updated with the PHP tag, follow an example of PHP cursor, using PDO:

<?php
function readDataForwards($dbh) {
  $sql = 'SELECT hand, won, bet FROM mynumbers';
  try {
    $stmt = $dbh->prepare($sql, array(PDO::ATTR_CURSOR => PDO::CURSOR_SCROLL));
    $stmt->execute();
    while ($row = $stmt->fetch(PDO::FETCH_NUM)) {
      $data = $row[0] . "\t" . $row[1] . "\t" . $row[2] . "\n";
      print $data;
    }
    $stmt = null;
  }
  catch (PDOException $e) {
    print $e->getMessage();
  }
}
?>

I adapted this example from PHP manual.

  • But the question remains which of the forms is the fastest to run... This form is the fastest of all?

  • Can you prove what you are saying? I have tested here and the results vary greatly depending on the amount of lines... For a few lines the fastest point is 1. for hundreds of lines the fastest point is 3. for thousands of lines the fastest point 2 if you define an N and a more or less decent X.

  • @Jorgeb. Yes, I can prove it, but I don’t need it and I don’t have time. You’re on the right track: doing tests to find out what’s best for you. Considering the scenario I described in my answer, this is undoubtedly the most performative way.

  • The problem is to find a faster method that fits all the conditions...

  • 1

    @Jorgeb. Unfortunately there is no faster method for all situations. Take as an example the sorting algorithms. The Bubble Sort, one of the most "stupid" algorithms, is faster than the Quick Sort, considered much more efficient, for small amounts of data.

-2

Depends on what you want to look for.

A specific line? a certain set of lines? all lines?

Here is a brief explanation with suggestions for the study and evaluation of your case.

When you filter a specific line using a index the query tends to be more efficient, since it returns little information (only one line) and the query is done through the index structure, using binary search, much more performative.

If you want to search a set of lines according to another condition, executing several times a query returning a line will undoubtedly be more expensive, even if they are indexes. The ideal is to elaborate an efficient query, using the minimum of joins, functions, etc., perhaps leaving for your program to do this work, thus taking work from the bank and passing it to the application. Other aspects should also be evaluated, if your query of a given set of rows always uses the same(s) columns(s), turn them into primary indices (clustered) for single and secondary values (nonclustered) for fields that repeat values.

To capture all lines just do not filter. To catch all lines is the most performatic? Yes, but it returns a large volume of data, using a lot of physical resource like memory and can also give a crash in its application due to volume. The ideal to select all lines (or many lines even with filter) is to page them using LIMIT X,Y which is a simple operation done on top of the processed result, discarding X first lines and selecting only the first Y lines of what is left.

There are many techniques of optimization of querys since the database modeling, normalization, server configurations, capture of logs of slow querys, good practices, use of primary and secondary indexes, cautious use of commands that wear the database (JOIN for example), etc.

The examples cited in the previous paragraph follow as suggestions for study so that its analysis capacity is refined to understand the needs of specific banks, each with its particularities.

  • 2

    All the lines, it says in the question

  • All lines just do not use the clause WHERE. But think about it, what would your app do at once with a million or 10,000 records? The ideal in this case is paging even, bringing enough for the user to see, if he wants to see more, he would navigate to the second page, third, fourth, up to page n, and this would be independent of the amount of records. It would not penalize the application with unnecessary information overload since no one can see 10,000 records at once for example and would not overload the bank, after all, as said, the use of the LIMIT is not costly.

  • 1

    @I don’t think you understand the question. Jorge does not want to know how to select all lines in SQL, so it is no use to say that it is only "not to use the Where clause". He wants to know how to do it efficiently. For example, you’ll hardly have memory on the server to load a million lines, so you’d have to read them in pieces. What’s the point? Example: an accounting system used in a market or store needs to generate files and reports with all products sold in the month. I’ve seen a $3 million case, just a grocery store.

  • In my answer I talk about it, just read carefully. att

Browser other questions tagged

You are not signed in. Login or sign up in order to post.