Notice: "In sqrt(diag(Object$vcov)): Nans produced" in Hurdle Model

Asked

Viewed 320 times

1

Hello!

I have a data set, with which I intend to perform a test to analyze the influence of some predictive variables on a response variable; as there are many zeros in my response variable (there are 766 zeros of 2830 sample units)I decided to use the Hurdle Model approach. Back in R, I wrote these commands:

fórmula <- dados$BC ~ dados$z_primeiro_artigo +
 dados$z_capacidade_científica + dados$z_tamanho_corporal +
 z_reproduções_por_ano + dados$Red_List_Status +
 dados$Tipo_de_desenvolvimento | dados$z_capacidade_científica +
 dados$z_tamanho_corporal + z_reproduções_por_ano +
 dados$Red_List_Status + dados$Tipo_de_desenvolvimento

resultado <- hurdle(formula = fórmula, dist = "negbin", data = dados, na.action = "na.fail")
summary(resultado)

Call:
hurdle(formula = fórmula, data = dados, na.action = "na.fail", dist = "negbin")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.1840 -0.6896 -0.2369  0.1864 16.3096 

Count model coefficients (truncated negbin with log link):
                                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)                           89.2998674  0.1065855 837.824  < 2e-16 ***
dados$z_primeiro_artigo               -0.0475314         NA      NA       NA    
dados$z_capacidade_científica          0.0751863  0.0048415  15.530  < 2e-16 ***
dados$z_tamanho_corporal               0.0020403  0.0006407   3.185  0.00145 ** 
z_reproduções_por_ano                  0.1797664  0.0761702   2.360  0.01827 *  
dados$Red_List_StatusEN               -0.4140505  0.1725280  -2.400  0.01640 *  
dados$Red_List_StatusLC                0.2434877  0.1372437   1.774  0.07604 .  
dados$Red_List_StatusNT               -0.2326801  0.1856711  -1.253  0.21014    
dados$Red_List_StatusVU                0.0002679  0.1702307   0.002  0.99874    
dados$Tipo_de_desenvolvimentoLarval    0.4254052  0.0928358   4.582  4.6e-06 ***
dados$Tipo_de_desenvolvimentoVivípara  0.0109588  0.3846127   0.028  0.97727    
Log(theta)                            -1.1538934  0.1093832 -10.549  < 2e-16 ***
Zero hurdle model coefficients (binomial with logit link):
                                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)                            1.3147054  0.2712539   4.847 1.25e-06 ***
dados$z_capacidade_científica          0.0682073  0.0100039   6.818 9.23e-12 ***
dados$z_tamanho_corporal               0.0015036  0.0008404   1.789   0.0736 .  
z_reproduções_por_ano                  0.3522174  0.2009335   1.753   0.0796 .  
dados$Red_List_StatusEN               -0.4264203  0.1776977  -2.400   0.0164 *  
dados$Red_List_StatusLC               -0.1618832  0.1555683  -1.041   0.2981    
dados$Red_List_StatusNT               -0.2458956  0.2064901  -1.191   0.2337    
dados$Red_List_StatusVU               -0.2674147  0.1880392  -1.422   0.1550    
dados$Tipo_de_desenvolvimentoLarval    0.0385487  0.0989498   0.390   0.6968    
dados$Tipo_de_desenvolvimentoVivípara  0.1392403  0.4588545   0.303   0.7615    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta: count = 0.3154
Number of iterations in BFGS optimization: 27 
Log-likelihood: -5853 on 22 Df
Warning message:
In sqrt(diag(object$vcov)): NaNs produzidos

Note that the values of the variable "z_first article" appear as "NA", and I did not understand this warning message at the end: "In sqrt(diag(Object$vcov)): Nans produced". Someone would know how to help me?

1 answer

2


Generalized linear models don’t do magic. It’s no use having data, trying to adjust a model to them and believing that everything will work out. Also, it is very difficult (perhaps impossible) to give you a definitive answer without working with the same data that you are using. However, it is possible to raise some hypotheses about what may be happening.

0) Before you throw the dice into a model, do an exploratory analysis. Plot them. Make simple statistics such as mean and standard deviations for quantitative variables and frequency tables for categorical variables. This will help define better ways to solve problems that may arise in your analysis.

1) I counted 6 variables for counting modeling and 5 for zeros excess. Is that correct? Is there a reason to exclude z_primeiro_artigo extra zeros modeling? Does the zeros excess modeling part have to be this complex? Anyway, with 6 covariables, it is possible that among these 6 predictive variables, some pair of them have high correlation. This creates a problem called multicollinearity. Research it and see how it can affect your regression.

2) z_primeiro_artigo has a standard error equal to NA. This means that the variability of the estimation error of this parameter could not be calculated. Check whether z_primeiro_artigo is constant. The fact that there is no variation in this covariable may be the reason why this is occurring.

3) In sqrt(diag(object$vcov)): NaNs produzidos means that some diagonal elements of the matrix object$vcov are negative. Check whether diag(resultado$vcov) has negative numbers. If so, this means that the Hessian matrix of the model is not positive defined. One way to solve this problem is to check if your data is on the same scale. For example, some covariables may be in the order of units and others in the order of hundreds. This almost always gives problem when adjusting linear models. See how to transform data using scaling. Just be careful that the inferences made from transformed data are different from the inferences made in the original data.

As you can see, no answer from me is definitive. This problem is not simple and it is impossible to give an accurate diagnosis without having access to the data. Finally, I don’t think sample size is a problem. 2830 is a very reasonable size for this case, with this number of covariables.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.