How to do Multiple Linear Regression?

Asked

Viewed 3,369 times

4

-- Rephrased question --

I have the following code

mdl = LinearModel.fit(X_train,y_train);
pred = predict(mdl,X_test);

in which X_train is a matrix [m x n] with m examples and n Features (past time series values). y_train is a column vector with the response (values at the next instant). Note that my team series is univariate type, that is, I only have the record of a variable over time.

Issues:

  1. Is this correct? By Luiz’s answer, it is.
  2. The y_train can it contain the value of the P signal steps forward? Ex: f(t+p)
  3. Can you explain to me the output of the model?
  4. The linear function is approximated using the method of least squares, right?
  5. There is some reason for this method to give better results when compared to the SVR algorithm with both the linear kernel and RBF?

EDIT1:

clc; close all;
clear all;

T = 20;


% Pode inicial aqui fazendo RV = SuaVariavel
RV = 1:T;
RV = RV';

%plot(RV)

%# Construindo as variaveis com 3 defasagens

N = 3;
first = 1;
for i=N:-1:1
    if first == 1
        X = RV(i:end+(i-1)-N);
        first = 0;
    else
        X = horzcat(X,RV(i:end+(i-1)-N) );
    end
end

HP = 1; % <-----------------------------------------------------

X = horzcat(ones(length(X),1),X);
y = RV(N+HP:end,end);

lim = length(X);

X_train = X(round(1:lim*0.7),:);
y_train = y(round(1:lim*0.7));

X_test = X(round(lim*0.7+0.5):end-(HP-1),:);
y_test = y(round(lim*0.7+0.5):end);


%# MQO
mdl = regress(y_train, X_train);
mdl


% Prever
for i=1:length(X_test)

    pred(i)=sum(mdl'.*X_test(i,:)); 

end


figure()
hold on;
plot(pred,'r*-')
plot(y_test,'go-')
hold off

EDIT 2: In this link is an image with the autocorrelation graph (top) and the PACF (bottom). My conclusions are that it proves that an AR model works because the value of self-correction slowly decreases as the lag value increases. In addition, the maximum lag that can be considered is 6. What happens if we consider more? Simply the forecasts in the future will not be so certain (may contain more noise due to the inclusion of more moments past) ?

  • 1

    Hello, welcome(a) to SOPT. Not to sound crass, but have you ever looked the documentation of Matlab? There are a number of examples. Your question is more likely to be answered if you make it more focused and objective. Try to post what you already tried (including the code of a minimum example, if you have it) and focus on your specific difficulty, particularly including details of the problem domain you want to solve.

  • Hello. I’m still reading some things but since to be honest I don’t understand any regression yet I haven’t been able to write any code so any help will be welcome.

  • I understand. But the point is that SOPT is not a forum, but a question and answer site lenses (has already done the [tour]? ). If you have difficulties with mathematics, I suggest studying it before starting learning with Matlab. Take a look at the examples, try to solve something simple and increase the complexity. And when you have any specific questions (example: what does such a command do? why is the answer to that code? etc.), come here and post the question (you can post more than one question). As the question stands today, it is too broad to have objective answers.

  • Ah, it’s important to remember that Matlab is within the scope of the site, but math alone is not. SOPT is about programming. If you’re easy with English and want to post a question just about math, I suggest another site from the Stackexchange group, the Mathematics.

  • Well ... in research on this subject I found the Linearmodel.fit(X,y) method that creates a linear model, right? My X is a matrix [m x n] where m is the number of examples and n is the number of values passed (t-n),(t-n+1),...,(t) and y is a column vector [m x 1]. Can I do this? mdl = Linearmodel.fit(X_train,y_train); y_pred = Predict(mdl,X_test); One more question: the values I pass on y_train are those of the next instant (t+1). Can I spend other moments, for example (t+4) and when I do the Predict I get the forecast 4 times in the future? I hope you can understand me.

    1. Yes you can. 2) The values of y_train MUST be the actual classes/results of the training data, and not what you want to predict 3) Can, provided you have the vector of the example of that moment in the future. -- That’s why I said, edit your question to make it more specific about your question and provide examples of your data/code. Readings that can be useful: reading 1 and reading 2.
  • 1

    @Luizvieira, thanks for your help. I already edited the question and added some more questions :)

  • 1

    What it is trying to do is an autoregressive model, where lags of the variable itself are used as explanations of future values. I don’t have Matlab installed on this pc, but look in the manual http://www.mathworks.com/help/econ/autoregressive-model.html?searchHighlight=Autoregressive%20Modeling

  • @Robert, I was already seeing this method however it is a little difficult to implement because of the parameters. According to the documentation m = ar(y,na). y are the values of the time series and na is how many moments past (say N ) we want to consider ( t-N, t-N+1, ..., t ) right? How can I make predictions with this model? I’ll keep looking but if I may give an example I thank you :)

  • HP has to be used also to update the matrix X, if not, by changing this value to 2 (or 3) will be mistakenly relating the lags, for example, y4 with with Y2 and below, disregarding Y3.

  • @Robert That would be the idea ... have for example the following case [1 3 2 1] in which (3,2,1) are the past instants and the output be 5 or 6 or even more. In other words it would be directly predicting 2 instants (in case the output is equal to 5) directly in the future instead of predicting the 4 and only then the 5. What I want to do is multi-step Ahead Prediction (HP >=2) but also experience the one-step Ahead Prediction (HP = 1). What do you think? It’s really not possible?

  • To determine the order of lags look at the autocorrelation function. If there is no autocorrelation your prediction may be very poor.

  • @Robert, please see Edit 2. However, I still have the same question I had when I wrote the previous comment. Can you explain it better?

Show 8 more comments

2 answers

2

Since you did not provide sample data from your problem domain (although I requested it twice), I had to use some publicly available data source on the Internet. I chose to use data from the rabbit population in the Chihuahuan Desert (on the border between the US and Mexico) between 1989 and 1994. Such data are publicly available here, and your documentation can be accessed here.

Observing: This data source actually contains the values populations of different animals (birds, rabbits, lizards, etc) collected on different days. To build the example I had to import the DAT file in Excel and filter only rabbit data from it. I did also the sum of the counts per year, to have as a result the following table:

Ano   Número de Coelhos
1989  32
1990  98
1991  90
1992  91
1993  105
1994  134

So I’ve prepared the following Matlab code example to illustrate the explanations.

% Exemplo de dados de treinamento
x = [1989 1990 1991 1992 1993 1994] % Ano de medição
y = [32 98 90 91 105 134]           % Número de coelhos no ano

% Plota o gráfico de dispersão (em uma janela com 80% do tamanho da tela)
screen = get(groot,'ScreenSize');
w = 0.8 * screen(3)
h = 0.8 * screen(4)
figure('OuterPosition', [screen(3)/2-w/2 screen(4)/2-h/2 w h], 'Name', 'Gráfico Ilustrativo - SOPT', 'NumberTitle', 'off')

hold on
scatter(x, y, 240, 'k', 's', 'filled')
axis([1988 1996 0 180])

title('\fontsize{25}Crescimento Populacional de Coelhos no Deserto de Chihuahuan (EUA/México)')
xlabel('\fontsize{16}Ano')
ylabel('\fontsize{16}Número de Coelhos')

% Adiciona ao gráfico as curvas de tendência linear (vermelho) e quadrática (verde)
fit_linear = polyfit(x, y, 1);
x2 = 1989:1994
y2 = polyval(fit_linear, x2)

fit_quad = polyfit(x(2:end), y(2:end), 2); % NOTA: Primeira medida intencionalmente ignorada (possível outlier?)
x3 = 1990:1994                             % Aqui também!
y3 = polyval(fit_quad, x3)

plot(x2, y2, 'b--o', 'MarkerFaceColor', 'b')
plot(x3, y3, 'r--o', 'MarkerFaceColor', 'r')

% Faz previsões para o ano 1995 e plota no gráfico
ano = 1995
prev_linear = polyval(fit_linear, ano)
prev_quad = polyval(fit_quad, ano)

plot(ano, prev_linear, 'b*', 'MarkerSize', 22)
plot(ano, prev_quad, 'r+', 'MarkerSize', 22)

legend('Dados Originais', 'Tendência Linear', 'Tendência Quadrática (ignorando primeira medição/possível outlier)', 'Previsão (Linear) para 2000', 'Previsão (Quadrática) para 2000', 'Location', 'northwest')

This code generates the following graph:

inserir a descrição da imagem aqui

Note in the code the following excerpt:

fit_linear = polyfit(x, y, 1);
x2 = 1989:1994
y2 = polyval(fit_linear, x2)

fit_quad = polyfit(x(2:end), y(2:end), 2); % NOTA: Primeira medida intencionalmente ignorada (possível outlier?)
x3 = 1990:1994                             % Aqui também!
y3 = polyval(fit_quad, x3)

What occurs there are precisely two regressions, one linear and one nonlinear (using a quadratic function as a basis). The function polyfit creates a model for training data (values of x and y) using a polynomial with the degree given in the third parameter (1 for linear, 2 for quadratic, 3 for cubic, etc). And the function polyval use that one model to estimate one or more new values. In this code snippet, the second parameter in the two calls (x2 or x3) are vectors, so he will estimate a result for each value of this vector, so also resulting in a vector (estimates). Further down I add in the graph an estimate for the year 1995, and you will see that the value passed is a scalar (a single value, rather than a vector). In this case, the function also returns a single scalar.

Such a "model" is nothing more than a function "extracted" from training data. Consider the example of linear regression. The idea is to try to "infer" a function that, given the variables that characterize the "problem" (in this illustrative example, only the year - value x), calculates the result based on this variable (in this example, the number of rabbits - value y). I won’t go into math detail because that is not the focus of SOPT, but one way to do this is as you mentioned: the factors a and b of the linear equation in the form ax+b are estimated from the quadratic error reduction between the original point (in the training data) and the point calculated by the function (the "estimated" value). Thus, the best equation is one in which the distance between the points (in the same graph - is only to compare the original points in black with the points of the function, in blue in linear and red in quadratic), accumulated, is the smallest possible.

If you test the above table data on an online linear regression site like this, for example, will realize that the result (such "model") is nothing more than a function. In this example, the resulting linear function is:

y = 15.2 x - 30179.133333333

inserir a descrição da imagem aqui

Where x is the year (the variable, characteristic or Feature an example in the problem area) and y is the number of rabbits (the desired estimated value).

Note that in the second nonlinear regression, the value of the first measurement (number of rabbits in the year 1989) was intentionally ignored because looking at the graph it seems to be a "point outside the curve" literally (a outlier). Maybe it’s a measurement error... or maybe it’s not. I don’t know why I don’t know in detail the domain of the problem and why the number of data (6 examples) is too small for a better decision. This is the kind of analysis that you need to do while developing your solution. Plotting the data on a graph and analyzing trends is a great way to do it.

This is very important because the choice completely changes the result of the estimation, as you will notice in the values of the chart estimated for the year 1995 (which vary greatly depending on the choice by the linear model or by the quadratic model ignoring the first example).

To close, there are other important details. First, this is merely an example illustrative. There is hardly any correlation between the year and the number of rabbits in the desert. It is natural that their number grows over time (and therefore the linear model may seem the most correct). But, sometimes the number of these animals has low due to food shortage (which may have occurred there between 1991 and 1992, justifying these data better match a quadratic model) and this is probably better explained by other variables. In fact, perhaps the problem is really linear with a more appropriate variable.

Also, to make my life easier (and to make it more didactic) I used only one variable in this model (the value of x as the year). Your problem may be more complex than this, and require more than one variable (i.e., need to be represented by an equation of the type g = x^2 + y^2 + z^2, with three variables x, y, and z). If you have a problem with two variables, your graph is already three-dimensional (because it will have an axis for each variable and one for the result), and the "curve" is no longer a line to be a plane. With more variables, this dimension grows and you can’t even visualize easily anymore (you’ll need to build visualizations for each pair of variables - as illustrated on Iris Dataset, on Wikipedia). But the principle of the thing is the same: only in the code its input vector (the x) will go from being a vector to being a matrix, as you mention yourself, with a column for each variable of interest.

The class you’re wearing (LinearModel) is more appropriate for multivariable regression, and that’s probably why you’re using it. The function fit makes model construction (i.e., "learn" function from data) and function predict uses this model to estimate a new result from new entries. That’s why you can’t confuse the data (y_train has the results for the data you have, and will use to build the model).

About using other algorithms, take some care. SVR is a regression built from the SVM algorithm (if you haven’t seen it yet, see this my other answer). Only this algorithm was originally built with the intention of serving as a classifier (that is, instead of predicting a value as a result of a function for the application of new example data, it indicates to which class of two possible belongs the example characterized by that data). As I explain there in the other answer, the SVR uses multiple tests to estimate, by means of probabilities, a regression value. In this case, it is more practical (and has better performance) if you use regression algorithms directly.

  • Thank you in advance for your very full reply. However, as I said my data is of a univariate time series, I have values with a fixed interval. I used the class LinearModel because I built my dataset so that each example had N past values and the current value. Ex: N=3 1 | 2 | 3 | 4 -> 5 (output value) X = [1,2,3,4] y = [5] Now the questions I haven’t figured out yet: 1. The value of y can be for example 6 (ie, would output 2 instants after the current and when we predict X we get a forecast for that instant) ?

  • The point is, in your example, you try to predict the population of rabbits according to the year, and my problem is that you predict the population of rabbits based on the history of the population of rabbits.

  • By univariate you mean you have only one variable? What is it? When I mentioned that you didn’t explain your domain it’s because you didn’t. What do these numbers represent? If I knew what they are, it might be easier to help. The value of y cannot be 6, because it is not the correct answer to the data [1, 2, 3, 4] (assuming that they are 4 different variables of an example/measure in time).

  • Anyway, you’re still confused and I’m not sure I understand what you need. But even so, we are entering a domain (time series) where I have very little experience, I don’t know if I can help anymore. But take a look at this article from the Matlab documentation, it may be useful to you: http://www.mathworks.com/help/econ/examples/time-series-regression-i-linear-models.html

  • P.S.: It occurred to me now that maybe you’re calling "Features" the values of the same variable over time. If so, it’s wrong. You have only one Feature, but with various values over time (as in my example). So, your x should not be a matrix, but a vector (again, as in my example) and it makes no sense to use the current value as y because it is not an "answer", but rather a value of its mass of data (at least in my understanding). But, anyway, again, I don’t know much about time series. I suggest trying to ask there in Math (which I suggested earlier).

  • You’re right about the "Features" part. Yes, it’s just one but measured over time. The data refer to the concentration of an acid (I only have this information) over time and I need to know at each new moment the concentration after N instants (in the future as is obvious). Why can’t I model this according to the past values of the same variable? Why y can’t be 6? Would be the exit in the instant t+2 assuming that the value 4 was obtained at the time t ... Anyway I will read the documentation you sent to see if I understand this better.

  • Because it’s not the same thing. In the example I gave, the number of rabbits (the y) is a value that is given as a function of another (the year, the x). So it doesn’t seem to me that you can use "the same modeling". But as I said, from there we are already entering a domain that I don’t know. I may be wrong. But for me, it doesn’t make much sense (especially using regression). Good luck. :)

Show 2 more comments

1

I’ll generate a random RV variable to show the procedure. To determine the number of lags you have to do some tests, you can search some text or here in the OS. For example this

%# Gerar uma variável aleatória autocorrelacionada em 3 períodos.
T = 1000;
%
RV = zeros(T, 1); RV(1)=10+randn(1); RV(2)=1.5+.8*RV(1)+randn(1);
RV(3)=1.5+.8*RV(1)-.3*RV(1)+randn(1);
for c = 4:T ; RV(c)=1.5+.8*RV(c-1)-.3*RV(c-2)+.25*RV(c-3)+randn(1); end

% Pode inicial aqui fazendo RV = SuaVariavel

plot(RV)

%# Construindo as variaveis com 3 defasagens
yout = RV(4:end); %# dependente
X1 = RV(3:end-1); %# defasagens da dependente (1er regressor)
X2 = RV(2:end-2);
X3 = RV(1:end-3);
%# matriz de regressores incluindo constante
X_train = [ones(length(X1), 1), X1, X2, X3];

%# MQO
mdl = regress(yout, X_train);
mdl

# Prever
X_test = [1, RV(T-1), RV(T-2), RV(T-3)];
%pred = predict(mdl,X_test);
sum(mdl'.*X_test) ; % Y em T+1
  • Thank you in advance for the example. I made a few minor modifications to the code and would like to know if the HP variable can assume values greater than 1 (see EDIT1). But I’d like to be sure.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.