COUNT(*) x COUNT(1) x COUNT(id)

Asked

Viewed 499 times

6

I would like to understand better about the difference between these ways of using the counter:

select COUNT(*) from tabela

select COUNT(1) from tabela

select COUNT(id) from tabela

This question deals with the performance between two of the forms, but I would like to detail better what differs each one, besides adding to the question a third way.

  • There is a (real) difference in performance?
  • There may be a difference in results?
  • It is possible to map the best use for each of the forms?

1 answer

7


select COUNT(*) from tabela

Take all table columns during the query, then all data will be available in memory for SGDB to do the counting operation and other things. If there’s no optimization. But many Dbs optimize this in some way to a simple expression like this, some can do this even in constant time O(1). Otherwise O(n).

select COUNT(1) from tabela

It takes a constant, that is, a value that is already in memory, so in theory it is to be absurdly faster by not having to load anything from the database, but it depends on having some optimization. It always depends on the implementation of the database. It can be O(1), but a majority will be O(n) even. The difference is that the load of each of the elements will be potentially smaller.

select COUNT(id) from tabela

Here it reads only one column, it is usually faster than the first (some cases of tables with few and short columns may be the same), and it will be the same as the previous one for having to read all the lines, even if it does not bring anything beyond the indicator to the counter, but again, unless you have some optimization, in this case this may be much slower than the previous one, but it may not be. It may be O(n) or O(1).

I put in the Github for future reference.

More details

Some databases have an optimization where the total count without filter (without WHERE) has already stored automatically and is guaranteed to be always updated, in this case the complexity will be O(1). The most scalable Dbs don’t usually have this because of technical difficulties. I can cite the example of Myisam from Mysql that always knows the COUNT() no filter or other criteria being used. But in most cases it will be O(n). If you have WHERE or other information on SELECT or make a JOIN or has other characteristics that can affect the count will certainly be O(n).

If it is O(n) the difference in performance will be small, especially in tables with very short lines. Because it will have to count even, it will have to load all lines to count. In very large lines there can be difference when using the COUNT(*), but only in these cases. And then the other 2 examples should be the same, because read the line just to tell without using any information and read a simple id same.

To understand this all depends on understanding how the internal storage of engine responsible for this in your SGDB and also how it works file system in general. And to understand that each page that the software makes to hit the disk (or SSD or other form of storage) is absurdly slower (it gets to be 3 orders of magnitude) than to take a data that is already in memory, no matter the size of the data, although it changes a little the proportion in some cases. And then understand when you need to go to secondary storage or not. So the cache can make an absurd difference, in some extreme cases it can go from less than 1 millisecond when everything is cached, to more than 1 second when nothing is cached.

There is a (real) difference in performance?

It depends a little on implementation whether it is a performance problem or not, because it is possible to optimize knowing that in practice, this example, makes no difference, after all the load of the data has zero function in this specific query. Not charging even telling you to do it doesn’t change anything. There just testing in each case to know if you do it or not. And the test can change at each run. Contrary to what people imagine testing database performance is very complicated. People expect linear answers, but in something that has so many optimizations this doesn’t happen.

There may be a difference in results?

Should not in this case in most Dbs, in other cases can.

It is possible to map the best use for each of the forms?

I believe so, whatever that means. But the basic is to test each one where you will use and see which one is faster (doing in various scenarios, considering the cache). In other examples it may be more a matter of doing what you expect or not, but it can only be analyzed case by case.

  • And the SELECT SUM(1)? How would you feel about others? About performance.

  • @Joãopedroschmitz is to be instantaneous because it is doing nothing in DB, but again depends on optimization, which I do not doubt that it ceases to exist in some DB.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.