The general linear regression formula is given by
It can be represented in a matrix form through the relation
where Y and Epsilon are vectors of n elements and X is a matrix given by
The least squares estimator of the beta parameters can be obtained from the
where X' is the transpose of X and (X'X) (-1) is the inverse of X'X.
For the inverse (X'X) (-1) to exist, X'X must be a full rank (or full rank) matrix. X'X will have completed if, and only if, its columns are not linear combinations of each other. Thus, the determinant of the matrix is non-zero and it is invertible.
When the columns of a matrix are linear combinations of each other, we say that the matrix is rank deficient (or incomplete, in English). The problem is that matrices like this are not invertible. Therefore, it is not possible to estimate the regression parameters according to the formula shown above, because (X'X) (-1) does not exist.
It is impossible to give a solution to a regression problem with matrix rank-deficient without looking at the data. However, there are some things that can be tried:
1) One of the predictive variables is a linear combination of the others. That is, some variable in your model is redundant. Search for multicollinearity in regression and how to remove variables from your model. See mainly what Variance inflation factor means.
This example below, created especially to be rank-deficient, shows a behavior similar to that of your problem, because among the two variables, one is exactly twice the other and therefore are a linear combination.
ajuste <- lm(mpg ~ wt + I(2*wt), data=mtcars)
predict(ajuste, mtcars)
Warning message:
In predict.lm(fit2, mtcars) :
prediction from a rank-deficient fit may be misleading
2) Perhaps the sample is not large enough for the model to be adjusted. At least two points are required to define a straight line. However, if I give only one point, with an x coordinate and another y, the R
will adjust a linear model to it without complaining:
x <- 1
y <- 3
ajuste <- lm(y ~ x)
predict(ajuste, data.frame(x=1.5))
Warning only appears at the time of prediction. Therefore, it may be that your model has too many parameters and less sample size. See the following case, where there are two predictive variables:
x <- c(1, 2)
y <- c(3, 1)
z <- c(5, 0)
ajuste <- lm(z ~ x + y)
predict(ajuste, data.frame(x=1.5, y=2.5))
He is also rank-deficient because there is little data. See how the problem is solved when increasing my sample size:
x <- c(1, 2, 1)
y <- c(3, 1, 1)
z <- c(5, 0, 1)
ajuste <- lm(z ~ x + y)
predict(ajuste, data.frame(x=1.5, y=2.5))
The general rule is to have at least a number of points equal to the number of parameters to be adjusted in the model. This ensures that the matrix will not be rank-deficient. Even so, it is not ideal, as other problems may occur. Run the command below and see that it was not possible to construct hypothesis tests for the parameters, even if the matrix is not rank-deficient.
summary(ajuste)
And if the predictive variables are categorical, there is another aggravating factor, because the dimension of the matrix (X'X) increases according to the amount of levels. The rule I put above is only valid if we consider that the predictive variables are quantitative.
In short:
Simplify your model; or
Collect more data; or
Read a good book on multiple linear regression
Thankfully, my problem was that some categories of one of my variables had only one sample. I had found it interesting R to return a real number to a prediction, where one of the Betas was NA, expected it to return NA also to that particular prediction.
– Márcio Mocellin
Ah, so it was a sample size problem. Not the total size, but in some categories.
– Marcus Nunes