How to use auto.Rima to predict 24 periods or more in R?

Asked

Viewed 417 times

4

I made a prediction using the auto.arima where my database is monthly values from Jan/2018 to Sep/2019.

My training base is from Jan/2018 to Jun/2019:

VL_TR_treino_5S = window(VL_TR_TS_5S, start=c(2018,1), end=c(2019,6))
VL_TR_teste_5S = window(VL_TR_TS_5S, start=c(2019,6))

And to apply the self. xreg, as an example I put A, B and C:

VL_TR_modelo_5S = auto.arima(VL_TR_treino_5S, xreg = cbind(A,B,C), trace = T, stepwise = T, approximation = T, seasonal = T)

And then I used the forecast using a period of 24 months:

VL_TR_Prev5S = forecast(VL_TR_modelo_5S, xreg = cbind(A,B,C), h = 24)

But when will I view the data from VL_TR_Prev5S, instead of showing me 24 predicted values (which would be up to ten/2020), shows me only 13 values that would be from Feb/2019 to Feb/2020.

print(VL_TR_Prev5S)

         Point Forecast   Lo 20   Hi 20
Feb 2019        7649351 7634063 7664639
Mar 2019        8260246 8244958 8275534
Apr 2019        8950091 8934803 8965380
May 2019        8657965 8642677 8673253
Jun 2019        8534740 8519451 8550028
Jul 2019        8349148 8333859 8364436
Aug 2019        7596208 7580920 7611496
Sep 2019        8515507 8500218 8530795
Oct 2019        8103160 8087871 8118448
Nov 2019        8143330 8128042 8158619
Dec 2019        7393488 7378199 7408776
Jan 2020        7007616 6992328 7022905
Feb 2020        6819635 6804346 6834923

When I run the script of auto.arima, despite the algorithm running normally, the r Leave me the following warning:

Warning message:
The chosen seasonal unit root test encountered an error when testing for the first difference.
From stl(): series is not periodic or has less than two periods
0 seasonal differences will be used. Consider using a different unit root test. 

I do not know if this warning has any bearing on the issue, but I have chosen to mention the warning for the sake of argument. Searching on some forums looks like using covariables on xreg can limit periods, but I don’t know why and I don’t know how I can avoid that either. But anyway, how can I use the auto.arima to provide for 24 periods or more?

  • 2

    Just to be clear to me: the training set has 13 observations and, from these 13 observations, the desire is to project 24 periods into the future? And besides, you’re trying to put on top of it a seasonality (I suppose 12 months)?

  • That’s right, if I’ve made a mistake you can tell me.

  • 1

    It is. Seasonality in ARIMA models is obtained through type differentiation (X_t - X_{t-k}), where k is the seasonal value. As in your case there are 13 observations and k=12, the seasonally differentiated series will have one observation only. Therefore, it is impossible to apply a seasonality like this in this amount of data.

  • 2

    In addition, I’ve been working with time series for over 15 years. I’ve never seen anyone use 13 observations to make the prediction 24 steps ahead. Even if it were possible to obtain results from this, they would not be reliable. After all, what you’re basically saying is "I have a year of behavior and I want to generalize it, predicting the next two years to come". Hardly comparing, it would be like flipping a coin once and trying to predict the next two results. Is it possible? Of course it is. Will it be a reliable model? I don’t think so.

  • Thanks for the clarifications Marcus Nunes, unfortunately I am tied hands on the quantity of observations, often we have to perform miracles with what is provided to us, at least now I have the necessary arguments.

1 answer

4


I will try to answer as best I can the questions posed in the reward of this question.

I understand that Data Science is not perfect, and when it comes to data in our day-to-day life it becomes apparent that the lack or the amount of insufficient data can be a problem, when portraying the question I have come to understand that it is not enough to create algorithms, but the database is essential for the algorithm to work.

Excellent observation. But I would like to complement the phrase "just create algorithms, but the database is essential for the algorithm to work". For me, work not only does the algorithm converge to some answer, but it converges to the nearest possible answer to reality.

In the case of my question I use a small database that in the ARIMA algorithm is not enough to understand seasonality.

Exact. Every time the ARIMA model has to consider seasonality (and thus become the SARIMA model), it will miss a certain number of observations. This is due to the differentiation that is made in the series. A series with seasonality k will miss k remarks.

In such cases the forecast of ARIMA is not reliable?

It is known that we need at least p + 1 observations to adjust a model to our data. For example, a line of type y = a*x + b has two parameters and needs at least 3 points to be adjusted. With two points it is possible to trace a line, but it is not possible to calculate the standard error of the estimators. So at least 3 points are required.

According to Hyndman and Kostento (2007), at least 16 observations are required to adjust a template SARIMA (0,1,1)(0,1,1)_{12}. Your model would be, at worst, a SARIMA (p,d,q)(P,D,Q)_{12}, what gives p + q + P + Q + d + m*D + 1 observations, at the very least, to fit a model to the data. And such a model would be as good as a 3-point regression line. In other words, it wouldn’t do much.

In addition, this account does not consider the variability of the data. p + q + P + Q + d + m*D + 1 observations would serve to model a SARIMA with little variability. If the data is misbehaved, further data will be required.

I’m not familiar with any study that calculates minimum sample sizes for time series estimation, but the literature I’ve read in the area (and the researchers' common sense) suggests something between 50 and 100 observations, at least.

I must do some kind of treatment?

No. If you want to use ARIMA, collect more data.

Or should I use another algorithm for prediction?

No. Collect more data. If someone tells you that it is possible to use some model or algorithm that is able to predict two future years using only one-year observations, this person will be deceiving you. It’s mathematically impossible to perform such a miracle.

To really understand why such a prediction is impossible, I suggest the book Brockwell and Davis (1991). It’s an old book, before the emergence of R, but it’s very mathematically strong. And his math hasn’t changed in 28 years. From a theoretical standpoint, it’s the best I’ve ever read in time series treatment. In it the time series are treated as Hilbert spaces of dimension n and its forecast m steps forward is the projection of these vector spaces of dimension n in spaces of dimension n + m. It is clear because any method that tries to predict many steps forward is doomed to failure. And forecasting error increases greatly, making any kind of usable conclusion impossible.

  • Thank you Marcos Nunes for the clarifications and the suggestion of the support material, this book by Brockwell and Davis seems to be very good, I believe it is exactly what I need.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.