Doubt about openCV training

Asked

Viewed 54 times

0

In a machine learning one must separate the data in 3 sets, one for training, another for validation and another for tests with relative quantity to 70%, 15% and 15% respectively. However my doubt is regarding the actual amount of images/data for an ideal recognition training.

I have used opencv trainscascade to do the training, but unfortunately for my project I do not have a considerable number of both positive and negative images so I would need a minimum number for these two types of images, what would it be? And as for the stages, I am aware that the more stages are placed as the more specific the recognition becomes, which can generate even an excessive training, so the ideal number of stages is also around 10 and 20 ?

And finally, if I do a training generating a date Cascade.xml, and then do another that complements this first Cascade it is possible to attach them somehow without having to retrain everything again?

Below is an example of code to facilitate the visualization of the parameters for training:

 opencv_traincascade -data 'diretório do cascade' \
-vec 'diretório das imagens POS.vec' \
-bg 'diretório das imagens NEG.txt' \
-numPos x \
-numNeg y \
-numStages z \
-w 24 -h 24 \
-mode ALL

1 answer

1


There is no exact formula for calculating how much training a model needs to receive to fall into an "excessive training" or overfitting that despite being an English term is the most widespread name here, it is necessary to test it gradually to know which of them got the best performance.

I imagine that your doubt has arisen, for fear of generating overfitting, which is basically the name given to a model that "got used" both to the sample pattern of your training dataset that has difficulties in identifying or predicting any sample that is outside of your dataset.

In your case it is easy to identify the optimal amount of training because as you mentioned the library itself separates 15% of the samples for testing purposes, i.e., they are samples that will not be present in your training dataset, this ensures that the values returned in the results are not being influenced by the training samples ( provided that the test samples are not equal to the training samples)so you can start training with 10 stages and then compare the results with a workout done with 20, if the result with 20 has improved there may still be room for improvement so try 30, now if the result with 10 has been better try to find between 10 and 20 the best result, and so until you are satisfied with the results.

Obs: I used as example 10 and 20 stages because they were mentioned in the question, the ideal is to work with a much larger range and funnel as the results appear.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.