Help with variable prediction with machine Learning and unbalanced classes in R(Caret)

Asked

Viewed 89 times

1

I’m learning Learning machine techniques to predict sheet size values (numerical) from multiple predictors (numerical). However, leaf sizes are conditioned to life form, (trees or grams), which are not balanced. At the moment, I am creating data separation using the "sheet size" values (the variable I want to predict) and generating separate models for each class. My question is: do I need to create separate templates for each class, or is there an option that I can separate the data into training and testing in existing classes and generate a single model that generates the prediction of sheet sizes, taking into account the class (forma_vida) (and if anyone has a tip~for someone who has never dealt with ml before~ of how to deal with the fact that they are not balanced).

library(caret)
# Parte dos dados
> dput(head(df))
structure(list(tam_folha = c(4L, 5L, 3L, 1L, 2L), forma_vida = structure(c(1L,2L, 1L, 2L, 1L), .Label = c("arvore", "grama"), class = "factor"), 
X1036 = c(0.349, 0.342, 0.383, 0.325, 0.309), X1037 = c(0.349, 
0.342, 0.383, 0.325, 0.309), X1038 = c(0.349, 0.342, 0.383, 
0.325, 0.309), X1039 = c(0.349, 0.342, 0.383, 0.325, 0.309
), X1040 = c(0.349, 0.342, 0.383, 0.325, 0.31), X1041 = c(0.349, 
0.342, 0.383, 0.326, 0.31)), .Names = c("X", "Y", "X1036","X1037", "X1038", "X1039", "X1040", "X1041"), row.names = c(NA,5L), class = "data.frame")

#Filtrando por classes
arvores = df %>% dplyr::filter(forma_vida=="arvore")

# Data partition
index <- createDataPartition(arvores$tam_folha, p = 0.7, list = FALSE)
train_data <- arvores[index, ]
test_data  <- arvores[-index, ]

controle = trainControl(method ="cv",number= 10, repeat=5, selectionFunction = "oneSE")
mod1 <- train(tam_folha ~ ., data = train_data,
method = "pls",
metric = "RMSE",
tuneLength = 4,
trControl = controle)

##repete para o fator::gramas

1 answer

0

Unfortunately I don’t have a minimum score to comment on, so...

From what I understand the goal is just to predict the size of a leaf, which can be a tree or a grass, correct?

If yes, a regression model should be constructed for each type of "plant", tree or grass. That is, two models should be created.

The question of de-balancing/balancing is rigorous in classification methods, which, there yes, could be thought of balancing and predicting which plant a given leaf belongs to, tree or grass?

I suggest using the Cross Validated for questions with this type of content.

As for routine automation, perhaps using a for with loop it is possible to generate both templates automatically. And beware of the function createDataPartition, not to get "random" results. Use set.seed().

Browser other questions tagged

You are not signed in. Login or sign up in order to post.