How does train_test_split work in Scikit Learn?

Asked

Viewed 2,843 times

1

I am learning machine Learning and in most examples is used the method train_test_split() and there is not a very precise explanation about it (at least not in the articles I’ve read).

I know your job is to split the dataset, but I have some questions:

  • Why data needs to be divided?
  • What is the purpose of the return variables train and test?
  • It is used only for productivity or there is an increase in the rate of accuracy of training?

3 answers

2


Well, to understand this question, it is illustrative to start by presenting the conceptual difference between models of Machine Learning supervised and unsupervised. Starting with the latter, unsupervised models are models that seek to make an estimation in a context that the response variable is not known. The classic case is implementations of core component models. In these models, it is possible to create components based on correlations of variables in the database, although in most cases the practical representation or concrete meaning of these components is unknown.
On the other hand, in supervised models, the dependent variable (or output, variable explained, response variable) is known. An example would be a model that tries to predict the participation of women in the labour market based on variables such as age, education, number of children, among others. The dependent variable, in this case, is a dummy which assumes value 1 if the woman is in the labor market and 0 if she is not. When you adjust the model to make this prediction it is important to know the predictive ability of your model, in addition to the data it was trained on. For this reason, it is common to separate your database into training and test. The data in the training base is used when training the model, while the data in the test base test the performance of the model outside the sample (with new data).
It is important to note that there will always be a difference of performance of the model in the different bases (training and testing). In fact, this difference is always in favor of the training base (think, the model already "knows" those data). From this difference it is possible to formulate another relevant distinction in Machine Learning, which is the difference between underfitting and overfitting. As shown in the figure below, taken from Andreas Müller’s book, we say that the model is underfitting when his performance is bad in both the base training and the base testing. When we increase the complexity of the model, its performance improves on both bases. However, a very complex model gets very "adjusted to the training base", that is, it hits the training base a lot, but has little generalization power. That’s what we call overfitting. Note that, from the perspective of the data scientist, the challenge is to maximize accuracy without losing the ability to generalize.

inserir a descrição da imagem aqui

In short, separating the base from the training base and the test base is key to knowing:
1) the accuracy of the model and
2) How much we can improve it without losing generalization capacity

That’s how I understand it, it would be nice to see other visions

2

Being very didactic and straightforward, the training set is the basis you provide your AI for her to learn. So this set she already knows. After training an AI it is common to need to analyze how good it is, for that it is necessary to provide a set of data that she has never been trained and see how well she does in the classification of the data. If the same training set is used for testing AI tends to have better results than it actually would have when exposed to other data of the same type.

The process is analogous to the teacher who teaches the students and then to test them applies a test with different questions. If he applies the same questions that students have asked or seen students tend to get better grades .

2

Why data needs to be divided?

An ML algorithm is expected to learn from the training set, but then how do we know if the model is working? If it works with new data? How we compare with other models?

The answer is simple, we see the score (accuracy) in the test set. This score tells us how well the model will behave with new data.

What is the purpose of the return variables train and test?

The purpose of these variables is to use the data set train to train the model and, with a data set never seen before, the test, see how the model handles new data.

It is used only for productivity or there is an increase in the rate of accuracy of training?

Performing various trainings with data sets (treino and test) different, we can find the best hyper parameters for the model that maximize the average accuracy of the various test sets.

Then divide into training and test improves the final model, which must be trained with the entire data set and with the hyper parameter already established.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.