Is it possible to train a model when I only have one of the mapped classes?

Asked

Viewed 64 times

3

I have a large dataset (~ 1,700,000) that I would like to sort. I also have a sample not so small (~ 8,000) classified as one of these classes (say, condition A), but I have none (zero) of the other classes (say, conditions B to Z). In addition, all variables are categorical.

Although there are many categories, I am interested only in one of them (the one that I have some sample, condition A).

Am I able to train the model with only type A observations? If not, how should I overcome this problem?

Is it reasonable to change the form of the problem to a binary type classification (type A would be TRUE and the other types FALSE)? In this case, can I randomly take some of the unclassified remarks and assume that the condition is FALSE? I know that most non-classified observations would be type B to Z (in the binary case FALSE).

From now on, thank you.

  • 1

    It does not seem reasonable to change the problem to a binary type classifier because you do not have training data for two "classes" (A and non-A). With data from only one class there is no way to create a model to decide the class because it cannot learn the distinctions.

  • 1

    Perhaps an approach that might be useful to you is to use some Clusterization algorithm, such as K-Averages, in the large data set. You don’t know what the classes are, but you seem to know how many are (that would be the value of K in K-Averages). This way, you may be able to have a separation that can be useful to you at least to decrease the amount of data and allow you a later analysis (and then yes, the construction of a binary classifier).

1 answer

1


You can turn the problem into binary if your assumption that among the unclassified ones the majority is false, as you say in your question. (Right, it would be not having any positive inside the unclassified, but if it is very small, it will probably not disturb)

I know that most of the unclassified remarks would be of the type B to Z (in the FALSE BINARY case).

Even, many classifiers use this when using strategy one-vs-Rest

As the discussion of the comments, I highlight:

  • if there are observations of Condition A within its 1,7M bank and its 8,000 sample is not a 1,7M subsample, this is probably not the best approach.
  • if the amount of A condition observations of the 1.7 M set is really small, this method, although biased, will be more accurate than randomly selecting a class.
  • But how will he train the binary classifier if he only has samples from one of the classes? (from what I understand he has no sample of classes B to Z)

  • @Luizvieira, from what I understand, has a bank of 1.700.000 but only has 8000 classified cases, all of the class A. Moreover, these 8000 classified cases correspond to (practically) all cases of class A among the 1.700.000.

  • Yes, but in this bank of 1.700.000 he does not know the classes (and wants to use a classifier to separate them). If he uses the 1,700,000 as "the rest", I think the chances of his classifier going wrong seem great. No?

  • 1

    Moreover, I do not think he said that the 8000 cases of class A are the only ones (that is, that in the 1,700,000 there are no samples of A). In fact, he says: "I know that most non-classified observations would be type B to Z (in the binary case FALSE)." I mean, he knows that most are not type A, but there may be A samples in the 1,700,000. :/

  • I understood the question in a totally different way, but I think you’re right. You think it’s best to delete the answer?

  • Anyway, a classifier built this way is better than random or at least equal.

  • No, leave the answer. It’s still useful, because you’re right that it might still be better than choosing randomly. But I still think AP needs to do a better analysis of the data (I suggested using Clusterization). If he can separate negative examples to use in training, his classifier looks much better. :)

  • 1

    Maybe just edit to add to your final observation (that a classifier built like this might not look good, but it will be better than choosing randomly). In fact, if the 1,700,000 actually have few samples of class A, the classifier will err more in false negative than in false positive. Maybe this will be very useful for the AP.

Show 3 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.