Negative variance in R? Floating point error propagation

Asked

Viewed 355 times

11

Suppose the following formula to calculate the variance:

variancia <- function(x) {
  n <- length(x)
  (1/(n^2-n))*(n*(sum(x^2))-(sum(x)^2))
}

Note that it is equivalent to the function var in most cases:

teste <- 1:5
var(teste)
[1] 2.5
variancia(teste)
[1] 2.5
all.equal(var(teste),variancia(teste))
[1] TRUE

Or in this other example:

set.seed(1)
x1 <- rnorm(100, 10, 100)
var(x1)
[1] 8067.621
variancia(x1)
[1] 8067.621
all.equal(variancia(x1), var(x1))
[1] TRUE

However, in the case below, it results in an impossible value (negative value):

set.seed(1)
x2 <- runif(1000) + 10^12
variancia(x2)
[1] -140878367
var(x2)
[1] 0.08316728

Why the difference between the two functions? How to ensure the function variancia get the correct value in the last example?

  • 10 12 does not extrapolate the largest integer representable in R?

  • In this case they are not integers, they are floating double points. They have 64 bits, 1 for the signal, 11 for the exponent and 52 for the signifier, which gives an accuracy of approximately 16 digits.

2 answers

11


His function was the victim of catastrophic cancellation. This can happen when subtracting two numbers next to and from the same sign, in the case of its function:

sum(x2^2)
[1] 1e+27

sum(x2)^2 / length(x2)
[1] 1e+27

In the case of the formula used in the function variancia this usually occurs when the variance of the vector is much lower than its mean.

I will propose two inefficient but simple solutions:

  • Use another formula:
variancia2 <- function(x) {
  n <- length(x)
  media <- mean(x)
  sum((x - media)^2) / (n - 1)
}

variancia2(x2)
[1] 0.08316728
  • Use your formula, but remove the mean of the vector, this does not change the value of the variance.
variancia(x2 - mean(x2))
[1] 0.08316727

4

Complementing Marcos Banik’s response.

A floating point number of type double (64 bits) can be roughly summarized in 3 parts:

Floating point type double: signal (1bit), order of magnitude (11 bits) and accuracy (52 bits))

This can represent orders of magnitude of about 10^308 but with a precision of about 16 digits (details on how the base package of the R handles numbers can be seen in help ?.Machine), in addition to irrational numbers or whose denominator is not a power of 2 are approximated.

Then see that a very large number can be represented by double, but not so accurately. This can cause major problems with operations such as summation and subtraction. The numbers calculated in the formula variancia are of the order of (10^12)^2=10^24 in the third example, and we only have 52 bits to represent significant digits (the others are inaccurate). When we subtract from each other, we eliminate the "good digits" and only the "bad digits" remain, causing the absurd result.

One way to solve the problem is to look for more stable algorithms for floating points, such as those proposed by Marcos. But assuming that’s not possible, you can use arbitrary precision numbers.

In the R the package Rmpfr (Multiple Precision Floating-Point Reliable) provides numbers with arbitrary accuracy (at the cost of spending more memory and running time, so depending on your computer and the problem, it is not always possible).

So, if it wasn’t possible to somehow improve the formula calculation algorithm variancia, we could use the Rmpfr. We would need more than 30 digits of accuracy, which would give more than log2(10^30)=99.65 bits. Rounding to 128 bits:

library(Rmpfr)
x2.mpfr <- mpfr(x2, 128)
variancia(x2.mpfr)
1 'mpfr' number of precision  128   bits 
[1] 0.0831672741323709434253943823576340475867

Browser other questions tagged

You are not signed in. Login or sign up in order to post.