Ks.test and p-value < 2.2e-16

Asked

Viewed 632 times

2

I am trying to compare two distributions, but when I apply Ks.test for both, I only get the value of’D' and p-value coincidentally gives the same value for both, '< 2.2e-16'. I had the idea of removing the zero values to see the result, and Ks.test presented all the values properly. But unfortunately, for this analysis, I have to leave also the values equal to zero.

Has anyone ever had this problem? Or any idea how to proceed? I need to have some value for p-value, in order to accept or reject the null hypothesis.


My data is extensive, so I had not put here. Below:

d<-c(4.1,3.7,11.1,15.0,5.1,12.3,0.1,0.2,0.0,0.4,0.0,23.2,0.0,0.0,13.2,0.0,0.0,0.0,0.0,18.6,3.3,0.2,4.2,0.1,0.0,0.7,11.6,1.0,28.9,0.0,0.0,0.0,2.3,10.5,9.7,1.7,0.0,0.5,0.0,1.9,16.7,26.4,9.2,1.2,1.4,9.0,35.3,8.6,0.6,0.0,0.0,0.1,0.5,2.9,27.2,0.0,0.0,0.0,0.0,15.4,0.0,0.0,5.3,1.3,2.1,0.3,22.1,0.0,0.0,5.7,4.2,68.5,1.7,8.7,0.0,9.6,0.0,15.6,0.0,1.9,14.8,0.1,2.4,0.0,0.0,1.1,22.0,1.8,39.4,0.0,0.1,29.5,14.0,0.0,4.5,0.0,37.2,0.0,0.0,21.6,0.0,21.6,1.3,24.5,1.9,1.8,14.1,12.1,0.0,0.1,0.0,0.0,0.2,15.4,1.2,0.4,0.0,0.0,0.0,0.0,0.1,18.9,0.2,0.7,0.8,0.6,17.2,0.0,0.0,0.1,0.1,0.0,0.0,0.1,0.0,0.7,21.2,35.7,0.0,0.0,.8,1.7,10.4,0.0,4.9,0.0,0.9,0.6,6.2,2.2,0.0,0.7,7.6,0.1,1.8,29.4,5.4,0.0,0.0,0.0,0.1,34.4,0.6,11.2,0.0,0.6,1.7,0.3,0.0,8.4,2.6,0.2,27.6,2.6,0.4,0.0,18.5,0.0,25.5,0.9,0.0,0.0,0.2,0.1,0.1,0.0,1.1,0.0,0.0,0.0,0.0,0.1,0.3,0.0,0.0,1.1,0.0,0.9,0.8,1.2,2.6,0.0,6.6,0.0,0.8,15.1,2.6,2.1,4.0,2.2,0.0,15.5,15.0,0.1,1.9,12.8,31.6,0.0,0.0,0.0,25.9,0.0,0.0,1.3,0.0,0.3,0.0,0.0,0.1,0.0,0.1,10.9,1.3,0.0,0.0,1.8,4.4,0.0,2.1,20.2,0.0,12.5,0.1,0.0,0.7,0.0,4.0,46.8,27.1,0.0,0.0,0.0,16.9,0.0,23.7,29.8,0.0,0.0,5.5,0.0,23.8,0.0,0.1,4.4,0.1,43.2,15.4,9.5,0.9,0.0,1.2,7.0,15.9,0.0,9.9,3.5,12.0,0.0,0.5,0.0,0.1,1.1,2.6,0.1,0.0,0.0,0.0,0.0,1.4,18.4,4.5,5.2,4.1,4.3,0.0,3.5,0.0,0.0,0.2,0.0,0.0,2.2,0.0,0.7,0.0,0.0,0.0,14.5,3.1,0.0,0.0,0.1,5.7,0.5,0.1,0.2,0.0,0.0,6.8,0.0,0.2,18.3,0.0,0.2,0.0,0.0,2.5,40.9,4.4,0.0,0.0,0.8,1.0,4.5,0.1,0.0,0.0,0.0,0.0,0.0,0.3,0.4,11.9,0.0,0.0,0.6,12.2,0.0,0.0,0.3,9.3,9.3,1.6,6.1,0.0,19.0,0.0,0.0,0.0,1.4,0.0,0.1,0.0,8.2,5.3,0.0,0.0,3.4,0.0,0.0,0.0,24.1,0.2,15.7,0.0,0.0,12.1,4.1,5.8,13.2,1.0,64.2,0.0,0.5,10.6,0.0,7.0,4.3,0.0,0.0,16.7,29.8,49.3,57.8,4.3,1.2,0.0,0.0,0.0,0.0,6.8,10.6,3.7,2.2,0.0,0.1,5.1,0.0,0.0,1.0,4.3,0.0,43.5,5.6,0.0,7.7,0.0,0.0,18.7,0.3,0.2,0.4,0.0,0.0,23.0,0.0,0.0,0.2,9.5,0.0,5.1,6.4,0.0,28.0,0.0,0.0,3.2,0.0,0.5,1.2,2.3,42.3,0.0,0.0,1.8,0.0,0.2,5.8,30.8,3.1,2.7)

The line of reasoning was as follows:

n<-length(d[!is.na(d)])
media<-mean(d)
desvio<-sd(d)
vetor<- as.vector(d[!is.na(d)])
variancia<-var(vetor)*(n-1)/n
alfa<-(media)^2/(variancia)
beta<-(variancia)/(media)

ks.test(vetor,"pgamma",shape=alfa, scale=beta)
D = 0.3792, p-value < 2.2e-16
alternative hypothesis: two-sided

Compared to a normal:

ks.test(vetor,"pnorm",mean=media, sd=desvio)

D = 0.3002, p-value < 2.2e-16
alternative hypothesis: two-sided

I tested it because I wanted to compare it with the two distributions, Gamma and Normal. So that in the end I could compare the two p-value values and see which one fits best with my data. But the two continue to appear p-value as: < 2.2e-16

  • welcome to Sopt. Enjoy doing the Tour to better understand how the site works.

  • Iara, the Kolmogorov-Smirnov test (ks.test) compares whether a sample follows a given continuous probability distribution or whether two samples follow the same continuous distribution. It is a non-parametric test that, in order to reject or reject the null hypothesis, we have to compare the value of the statistic D with the critical values of a table that depend on the sample size and the level of significance (both not informed). Although your doubt doesn’t seem to be about language R, if you provide us with a piece of your data, we may be able to help you in a better way.

  • The results you obtained are not equal, which are both smaller than 2.2e-16. To see why R looks like this, see the help page help(".Machine") and print the value of .Machine$double.eps.

2 answers

6


Probably what I wrote here won’t completely answer the question, but the comment space is too small for what I have to say.

It does not seem correct to raise the hypothesis that these data are normal. See the histogram:

inserir a descrição da imagem aqui

And this is exactly what the Kolmogorov-Smirnov test is telling you. By testing the hypotheses

H_0: d é gama
H_1: d não é gama

and

H_0: d é normal
H_1: d não é normal

you reject both null hypotheses. That is, your data is not gamma with alpha and beta, nor normal with average and deviation. So, nothing wrong here.

The problem now is to find out what the distribution of your data is. Note that the zero bar in the histogram is very high. Realizing that, I went

table(d > 0)
FALSE  TRUE 
  171   280

which is used to count how many zeros and how many non-zeros there are in the data set. In this case, we have 171 zeros and 280 non-zero values. This looks like a mixture of distributions, where one distribution is responsible for the positive measures and another only for the zeros.

Another idea we can test is to find some distribution for the data from the package fitdistrplus:

library(fitdistrplus)
fitdist(d, "gamma")
<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, 
  obs = data,     gr = gradient, ddistnam = ddistname, hessian = TRUE, 
  method = meth,     lower = lower, upper = upper, ...): function 
  cannot be evaluated at initial parameters>
Error in fitdist(d, "gamma") : 
  the function mle failed to estimate the parameters, 
            with the error code 100

Note that not even this package can find suitable parameters for a range to fit this data.

However, we can try an exponential:

fitdist(d, "exp")
Fitting of the distribution ' exp ' by maximum likelihood 
Parameters:
      estimate  Std. Error
rate 0.1867882 0.008795259

Now yes the thing became more interesting. At least the estimate of the parameter of the exponential converged. However, when we plot the exponential density on top of the histogram, the result is not so cool:

inserir a descrição da imagem aqui

This is so true that, when running the Kolmogorov-Smirnov considering an exponential, again we reject H_0:

ks.test(d, "pexp", 0.1867882)

    One-sample Kolmogorov-Smirnov test

data:  d
D = 0.43562, p-value < 2.2e-16
alternative hypothesis: two-sided

Warning message:
In ks.test(d, "pexp", 0.1867882) :
  ties should not be present for the Kolmogorov-Smirnov test

That is, these data also do not have exponential distribution with parameter 0.1867882.

So you have two options here:

1) Go trying asymmetric distributions right in with the package fitdistrplus. If the estimation works, run Kolmogorov-Smirnov to confirm that the data has, in fact, the distribution found.

2) Asking yourself why you have so many zeroes in your dataset. 171 out of 451 (38%) observations equal to zero is not generally expected. Where did this data come from? Is it expected that this collection will actually have this amount of zeros? The equipment or person you collected may have done something wrong?

3) Work with mixing distributions, which there is a slightly more complicated area.

  • Thank you very much Marcus Nunes. The strange part was really that, exclude the hypotheses for the two distributions. But I hadn’t really thought in the sense of mixed distributions. And yes, I thought the problem was the amount of zeros in the sample, which would make the test not 'work', at least for one of them. But again, thank you so much!

  • It’s great to know that my response has helped you in some way. So consider vote and accept the answer, so that in the future other people who experience the same problem have a reference to solve it.

2

First of all, I think you’re getting off on the wrong foot. It should not decide at the outset that it will compare the distribution of the data with such and such parametric distributions.

Should begin with view the data. Start with the basic descriptive statistics given by the function summary.

summary(d)
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#0.000   0.000   0.500   5.354   5.350  68.500

This shows an asymmetric distribution, see that the minimum and the first quartile are equal. Another indication of this is the difference between the mean and the median. Yet another indication is that we have the mean, a statistic very sensitive to extreme values (outliers), above the 3rd quartile.
We can also see that there are no values NA, but as you seem to be concerned about it, so much so that it even created the vector vetor from the vector d removing any Missing values NA, here is a way to check if there are and how many there are.

sum(is.na(d))
#[1] 0

And to see the distribution there is the always useful histogram.

hist(d, prob = TRUE)    # Ver os dados

This data are certainly not Gaussians.
So let’s go to gamma distortion. Your parameter calculation was wrong, the right way is this.

media <- mean(d)
variancia <- var(d)
alfa <- media^2/variancia
beta <- media*variancia

ks.test(d, "pgamma", shape = alfa, scale = beta)
#
#   One-sample Kolmogorov-Smirnov test
#
#data:  d
#D = 0.47659, p-value < 2.2e-16
#alternative hypothesis: two-sided
#
#Warning message:
#In ks.test(d, "pgamma", shape = alfa, scale = beta) :
#  ties should not be present for the Kolmogorov-Smirnov test

As for the meaning of repeated data and how this affects the Kolmogorov-Smirnov test, see Cross Validated and the function manual ks.test:

The presence of Ties Always generates a Warning, Since Continuous distributions do not generate them. If the Ties arose from rounding the tests may be approximately Valid, but Even Modest Amounts of rounding can have a significant Effect on the calculated statistic.

In English (Google Translate + my review)

The presence of repeated values always generates a warning, since continuous distributions do not allow them. If repeated values appear from rounding, the tests may be approximately valid, but even modest amounts of rounding may have a significant effect on the calculated statistics.

It is also possible and more natural to use the function fitdistr of the base package MASS to calculate the values of the parameters. Since the data has many zeroes and this function does not accept to adjust a range when the data has zeros, I will add a very small value to each zero.

vetor <- d
inx <- vetor == 0
vetor[inx] <- vetor[inx] + .Machine$double.eps^0.5

params <- MASS::fitdistr(vetor, "gamma")

Now the Kolmogorov-Smirnov test.

sh <- params$estimate["shape"]
ra <- params$estimate["rate"]

ks.test(vetor, "pgamma", shape = sh, rate = ra)
#
#   One-sample Kolmogorov-Smirnov test
#
#data:  vetor
#D = 0.26847, p-value < 2.2e-16
#alternative hypothesis: two-sided
#
#Warning message:
#In ks.test(vetor, "pgamma", shape = sh, rate = ra) :
#  ties should not be present for the Kolmogorov-Smirnov test

Finally, the graphs with the densities curves calculated above.

hist(vetor, prob = TRUE)
curve(dgamma(x, shape = alfa, scale = beta), 
      from = 0, to = 70, add = TRUE, col = "blue")
curve(dgamma(x, shape = sh, rate = ra), 
      from = 0, to = 70, add = TRUE, col = "red")

inserir a descrição da imagem aqui

I believe that you should try to find models that work well with so many zeros, that it should be very difficult to find a parametric distribution that fits this data. Although the graphs were not bad, both tests rejected the null hypothesis.

  • Thank you Rui Barradas.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.