What is rank-deficient and how to get around that?

Asked

Viewed 359 times

3

I did a linear regression lm(), where he declared some variables as factor, and I got some Bs as NA as:

citySão José
NA

When I made the prediction, the prediction occurred and I received the following warning:

Warning message:
In predict.lm(modeloAIC, matriz_de_estimação) :
uma predição a partir de um ajuste rank-deficient pode ser enganoso

I was left the doubt of how to get around this and how it was predicted who had the factor San Jose.

1 answer

4


The general linear regression formula is given by

inserir a descrição da imagem aqui

It can be represented in a matrix form through the relation

inserir a descrição da imagem aqui

where Y and Epsilon are vectors of n elements and X is a matrix given by

inserir a descrição da imagem aqui

The least squares estimator of the beta parameters can be obtained from the

inserir a descrição da imagem aqui

where X' is the transpose of X and (X'X) (-1) is the inverse of X'X.

For the inverse (X'X) (-1) to exist, X'X must be a full rank (or full rank) matrix. X'X will have completed if, and only if, its columns are not linear combinations of each other. Thus, the determinant of the matrix is non-zero and it is invertible.

When the columns of a matrix are linear combinations of each other, we say that the matrix is rank deficient (or incomplete, in English). The problem is that matrices like this are not invertible. Therefore, it is not possible to estimate the regression parameters according to the formula shown above, because (X'X) (-1) does not exist.

inserir a descrição da imagem aqui

It is impossible to give a solution to a regression problem with matrix rank-deficient without looking at the data. However, there are some things that can be tried:

1) One of the predictive variables is a linear combination of the others. That is, some variable in your model is redundant. Search for multicollinearity in regression and how to remove variables from your model. See mainly what Variance inflation factor means.

This example below, created especially to be rank-deficient, shows a behavior similar to that of your problem, because among the two variables, one is exactly twice the other and therefore are a linear combination.

ajuste <- lm(mpg ~ wt + I(2*wt), data=mtcars)
predict(ajuste, mtcars)

Warning message:
In predict.lm(fit2, mtcars) :
  prediction from a rank-deficient fit may be misleading

2) Perhaps the sample is not large enough for the model to be adjusted. At least two points are required to define a straight line. However, if I give only one point, with an x coordinate and another y, the R will adjust a linear model to it without complaining:

x <- 1
y <- 3

ajuste <- lm(y ~ x)

predict(ajuste, data.frame(x=1.5))

Warning only appears at the time of prediction. Therefore, it may be that your model has too many parameters and less sample size. See the following case, where there are two predictive variables:

x <- c(1, 2)
y <- c(3, 1)
z <- c(5, 0)

ajuste <- lm(z ~ x + y)

predict(ajuste, data.frame(x=1.5, y=2.5))

He is also rank-deficient because there is little data. See how the problem is solved when increasing my sample size:

x <- c(1, 2, 1)
y <- c(3, 1, 1)
z <- c(5, 0, 1)

ajuste <- lm(z ~ x + y)

predict(ajuste, data.frame(x=1.5, y=2.5))

The general rule is to have at least a number of points equal to the number of parameters to be adjusted in the model. This ensures that the matrix will not be rank-deficient. Even so, it is not ideal, as other problems may occur. Run the command below and see that it was not possible to construct hypothesis tests for the parameters, even if the matrix is not rank-deficient.

summary(ajuste)

And if the predictive variables are categorical, there is another aggravating factor, because the dimension of the matrix (X'X) increases according to the amount of levels. The rule I put above is only valid if we consider that the predictive variables are quantitative.

In short:

  1. Simplify your model; or

  2. Collect more data; or

  3. Read a good book on multiple linear regression

  • Thankfully, my problem was that some categories of one of my variables had only one sample. I had found it interesting R to return a real number to a prediction, where one of the Betas was NA, expected it to return NA also to that particular prediction.

  • Ah, so it was a sample size problem. Not the total size, but in some categories.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.