Calculate Geometric Mean in SQL Query

Asked

Viewed 574 times

1

Table Example

Campo - Valor
A     - 30,21
B     - 15,03
C     -  5,06
B     - 10,20
B     - 12,89
C     -  1,12

Geometric average calculation

A = (30,21)^1/1
B = (15,03*10,20*12,89)^1/3
C = (5,06*1,12)^1/2

Results

A = 30,21
B = 12,23
C =  2,37

I tried something like this

POWER(POWER(10.00,SUM(LOG10(CAST(ANSDET.VALOR AS DECIMAL(30,2))))), 1/(CAST(COUNT(ANSDET.CAMPO)AS DECIMAL(10,2)))) VALOR,

The table has about 4000 records, I’m receiving the message:

An invalid floating point operation has occurred.

I need the SQL formula that gives me these average results... Any idea?

2 answers

4


I am glad that I was not the only person to go through the need of SQL production through Portuguese-speaking lands.

As @Motta spoke in his confusing and unattractive response, you can use the fact that the product log is equal to the sum of the logs and then turn your productive for summation (with some adaptations that need to be cancelled later, of course).

So let’s dive into the land of logarithms?


This answer will be divided into a few sections, to make it easier to manage and to be able to segment the reading. They are:

  1. from products to sums and back again, with SQL
  2. Product log is the sum of the log, but why?
  3. say no to unnecessary power: multiplying log(x) by a constant

NOTE: as logarithms are only well defined with real picture for numbers > 0, that response fails miserably if the dataset has at least one number in that situation, it will need to make an adaptation.

You can check on Sqlfiddle my answer to the data reported in the question. Noted that the answer to the case B is different from the expected? Well, my answer is the true geometric mean, maybe it was a floating point error/accuracy when writing the expected result.


From products to sums and back again, with SQL

You have a problem that is the calculation of a product. An example of productive well basic is the following, for those who are not used to:

produto de elementos em x do índice 0 ao 10

What does that mean? Does that mean y = x0 * x1 * x2 ... * x10. The product of the elements of the set, a counterpoint to the sum of the elements of the set.

Note how extremely similar it is to a sum:

exemplo de somatório

In this case, the product is represented by the Greek letter pi, whose sound is reminiscent of P of Portuguese and denotes product. The sum is represented by the Greek letter sigma, whose sound is reminiscent of S of Portuguese and denotes summing up.

The sum, then, is the same idea of production but using the sum operator: y = x0 + x1 + x2 ... + x10

Assuming as truth that log(A*B) = log(A) + log(B), we have to log(x0 * x1 * x2 ... * xn) = log(x0) + log(x1 * x2 ... * xn) = log(x0) + log(x1) + log(x2) ... + log(xn). So, we can make a sum of the logarithms, however, to completely replace, we need to undo the logarithm of the play.

But how? Simple, by its definition.

In the definition of logarithm, we have to e^x = n, with e and n known. The logarithm is the operation used to find the x of the question, therefore e^x = n <==> x = log(n).

From that definition, we have to e^(log(x0 * x1 * x2 ... * xn)) = x0 * x1 * x2 ... * xn. Precisely the definition of logarithm. But thanks to the property log(A*B) = log(A) + log(B), we can replace the product by sum without losing semantics: e^(log(x0) + log(x1) + log(x2) ... + log(xn)) = x0 * x1 * x2 ... * xn.

How to represent this in SQL? Using the exponential sum of the logarithms of a given column:

SELECT
  campo,
  exp(sum(log(valor))) as produtorio
FROM
  dados
GROUP BY
  campo

Note that, as it is not necessary to explain the basis, I am not using nor LOG10 nor POWER(10, expoente), yet simple LOG and EXP

Product log is the sum of the log, but why?

Be A and B any two positive real numbers. No other imposition on them is made. That means there is some a = log2(A) and b = log2(B) which are the logarithms of A and B respectively in base 2.

If I ever multiply A*B I’ll get a number C, also real and positive. How a = log2(A), that means that A = 2^a. Similarly we have B = 2^b. Soon, A*B = C = 2^a * 2^b. We can simplify the power of 2 like this: C = 2^(a+b).

And what if I took the log2 of everything?

Well, knowing that C = 2^(a+b), we have to log2(C) = a + b. But we must not forget that C = A*B. Soon, log2(A*B) = a + b. Replacing the initial formulas of a and b:

log2(A*B) = log2(A) + log2(B)

The calculations were made on base 2 by illustration, but could be any basis, even φ.

Say no to unnecessary power: multiplying log(x) by a constant

But you don’t just want the product, you want the geometric average. The product was just a means to reach the geometric average. The formula for the geometric mean of a set is given by the following formula:

fórmula da média geométrica perfeitamente maquiada para esta resposta

That is, it is the exponentiation of the production by the inverse of the quantity of elements in the production.

If you’re going to transport exactly that to SQL it looks something like this:

SELECT
  campo,
  power(exp(sum(log(valor))), 1/count(*)) as media_geo
FROM
  dados
GROUP BY
  campo

But it gets weirder to exponent twice in a row, don’t you think? Besides appearing a lot of computational demand for nothing? And, yes, you’re right: there’s work being done unnecessarily here.

It has a very important property of exponentiation that we’re going to use it now: it’s distributive about multiplication:

propriedade distributiva da exponenciação sobre a multiplicação

And there’s another cool property of logarithms: log(a^b) = b * log(a). From here, we can quietly arrive at this transformation:

derivando a fórmula para usar logaritmos, somatórios, multiplicação e um único exponencial

mesma coisa do de cima, porém focando apenas nos elementos extremos

In SQL, it would look like this:

SELECT
  campo,
  exp(sum(log(valor))/count(*)) as media_geo
FROM
  dados
GROUP BY
  campo
  • Using these formulas I get the following error, "Invalid floating point operation occurred."

  • I already got this error when I passed negative numbers or zeroes to the log. I will test more calmly the query against the database provided in your question.

  • The reply of Quesado was 10 , I will keep mine as a reminder , never respond from the tablet laziness makes you edit badly and dirty .. this solution via logarithm and potentiation has the problem of negative or very large values.

  • In the values provided by the question worked. The problem may be the presence of null or negative numbers, for which the log is not defined in real. By the way, the geometric mean of the case B is 12.5489..., as I found in my consultation

  • @Motta, I got used to editing on the smartphone... I think the tablet is even easier. Obviously this answer here was not written from the smartphone, but from the notebook. In compensation, this one here was just smatphone, this other I used the smartphone a lot, but I can’t remember if it was just to create the images or if it was for everything

  • @henriquew what data set used?

  • Good afternoon, my error was the following, within the various records that had, havian numbers of type (0.1 , 0.8 , 0.59), in "fórumula Exp(sum(log(value))/Count(*)) as media_geo" then occurred the error "An invalid floating point operation occurred." Thank you all!

  • @henriquew had no problems with this data in the formula. I put these 3 values. I tested with 15 values of this suit and got no problems

  • Try the following values (167.00 68.00 1876.00 225.00 2732.00 3754.00 9999.00 231.00 340.00 946.00 163.00 3348.00 819.00 1426.00 148.00 618.00 5721.00 1753.00&Xa;1978.00&Xa;3828.00 1104.00) or (0.57 0.88 1.63 1.38 0.32 0.57 1.32 1.38 2.22 0.79 3.85 0.46 0.63 0.86 0.88 0.86 2.19 1.47 0.32 0.82 2.40) in the formula "Exp(sum(sum(ANSDET.VALUE)/Count(*)) as VALUE"

  • Average of 882.011621537022, and average of 0.9984505031117724. I used the grouping D for the first set of numbers, and E for the second set

  • My results were not perfect, but I did the best I could in the operations that SQL Server provides. Each floating point calculation implies losing precision 1 2 3.

  • See the result for D, and to E. Note that I only lost accuracy in D, in E was a satisfactory response

Show 7 more comments

0

--li ponderada e nao geometrica
--usando a propriedade dos logs se tem o produto
exp(sum(log(valor)))
--calcule a raiz
power(exp(sum(log(valor))),(1/(count(*)))
--isto deve resolver para números pequenos
  • Using these formulas I get the following error, "Invalid floating point operation occurred."

Browser other questions tagged

You are not signed in. Login or sign up in order to post.