Basic division in training and testing in R

Asked

Viewed 329 times

-1

I have a database, where my variable resposta are dates and the explanatory is the flow of the source of a city.

I generated a time series model as a way to try to understand how reliable my data was, however, the AIC measure is not as effective as.

The idea was to divide my base into training and testing and try to make a prediction of the data, which would help to be more sure of reliability. My data is:

Data       Fonte Férrea
jan/18     160,11
fev/18     NA
mar/18     150,88
abr/18     NA
mai/18     127,52
jun/18     171,25
jul/18     111,24
ago/18     111,26
set/18     109,79
out/18     295,12
nov/18     361
dez/18     365
jan/19     118,29
fev/19     112,18
mar/19     204,4
abr/19     109,95
mai/19     122,93
jun/19     130,43
jul/19     80,33
ago/19     96,52
set/19     83,46
out/19     101,71
nov/19     58,63
dez/19     119,67
jan/20     136,61

The question is: how to divide this data into training and testing?

The idea was to leave the last 4 observations in the base test and the rest in the training, however, I do not know how to put in the function of the R the last 4 observations, being them from Oct/19 to Jan/20.

The R function that generates the training and test data is:

treino=window(basededados,end=)
teste=window(basededados,start=,end=)
  • 1

    Is the response variable the same date? Is your goal to predict the future date as a function of the flow? Also, take a look at this link (mainly in the use of function dput) and see how to ask a reproducible question in R. So, people who wish to help you will be able to do this in the best possible way.

  • The response variable is the date. My goal is to forecast the data. For this, I’m dividing my base into training and testing. Only I could not understand how to put in the window function a start and end date when data is given in month/year.

  • 1

    Are you sure about this? Please explain then what it means "to forecast the data". Is it to predict future dates? Or predict the flow value on future dates? Because if it is the second case, the response variable is the flow rate. Also, share the data according to the link I passed above, to make it easier for us to help you.

  • The forecast of the data here in the case, is only to provide a reliable indication of how good the model is to predict new data, it is more a matter of reliability. My variable answer is the date, because I want to know how the flow is explained over time. The difficulty here, in this case, is how to divide the base I am using in training and testing, because my date is placed in the form month/year. I didn’t understand how to put in window function so you can read correctly.

  • i <- 1:(ncol(dados) - 4);train <- dados[i, ];test <- dados[-i, ].

1 answer

2

You can use the functions head and tail to choose the observations.

Here is how to reproduce the question data:

dados <- tibble::tribble(
 ~Data, ~Fonte_Férrea,
 "jan/18", 160.11,
 "fev/18", NA,
 "mar/18", 150.88,
 "abr/18", NA,
 "mai/18", 127.52,
 "jun/18", 171.25,
 "jul/18", 111.24,
 "ago/18", 111.26,
 "set/18", 109.79,
 "out/18", 295.12,
 "nov/18", 361,
 "dez/18", 365,
 "jan/19", 118.29,
 "fev/19", 112.18,
 "mar/19", 204.4,
 "abr/19", 109.95,
 "mai/19", 122.93,
 "jun/19", 130.43,
 "jul/19", 80.33,
 "ago/19", 96.52,
 "set/19", 83.46,
 "out/19", 101.71,
 "nov/19", 58.63,
 "dez/19", 119.67,
 "jan/20", 136.61)

And then just define how many observations you want in the training base, say 4, and create the objects.

n <- 4
treino <- head(dados, -n)
teste <- tail(dados, n)

If you want something more elaborate and structured to link with templates, I recommend taking a look at the package Recipes ().

Browser other questions tagged

You are not signed in. Login or sign up in order to post.