Is it possible to train a model when I only have one of the mapped classes?

Question

Is it possible to train a model when I only have one of the mapped classes?

Asked 8 years, 8 months ago

Viewed 64 times

3

I have a large dataset (~ 1,700,000) that I would like to sort. I also have a sample not so small (~ 8,000) classified as one of these classes (say, condition A), but I have none (zero) of the other classes (say, conditions B to Z). In addition, all variables are categorical.

Although there are many categories, I am interested only in one of them (the one that I have some sample, condition A).

Am I able to train the model with only type A observations? If not, how should I overcome this problem?

Is it reasonable to change the form of the problem to a binary type classification (type A would be TRUE and the other types FALSE)? In this case, can I randomly take some of the unclassified remarks and assume that the condition is FALSE? I know that most non-classified observations would be type B to Z (in the binary case FALSE).

From now on, thank you.

1

It does not seem reasonable to change the problem to a binary type classifier because you do not have training data for two "classes" (A and non-A). With data from only one class there is no way to create a model to decide the class because it cannot learn the distinctions.

– Luiz Vieira

2016/11/12 at 21:58
1

Perhaps an approach that might be useful to you is to use some Clusterization algorithm, such as K-Averages, in the large data set. You don’t know what the classes are, but you seem to know how many are (that would be the value of K in K-Averages). This way, you may be able to have a separation that can be useful to you at least to decrease the amount of data and allow you a later analysis (and then yes, the construction of a binary classifier).

– Luiz Vieira

2016/11/12 at 21:58

1 answer

Browser other questions tagged machine-learning

You are not signed in. Login or sign up in order to post.

by Daniel Falbel • **12,504** points · Answer 1 · 2016-11-17T11:34:25+00:00

1

You can turn the problem into binary if your assumption that among the unclassified ones the majority is false, as you say in your question. (Right, it would be not having any positive inside the unclassified, but if it is very small, it will probably not disturb)

I know that most of the unclassified remarks would be of the type B to Z (in the FALSE BINARY case).

Even, many classifiers use this when using strategy one-vs-Rest

As the discussion of the comments, I highlight:

if there are observations of Condition A within its 1,7M bank and its 8,000 sample is not a 1,7M subsample, this is probably not the best approach.
if the amount of A condition observations of the 1.7 M set is really small, this method, although biased, will be more accurate than randomly selecting a class.

But how will he train the binary classifier if he only has samples from one of the classes? (from what I understand he has no sample of classes B to Z)

– Luiz Vieira

2016/11/17 at 12:02
@Luizvieira, from what I understand, has a bank of 1.700.000 but only has 8000 classified cases, all of the class A. Moreover, these 8000 classified cases correspond to (practically) all cases of class A among the 1.700.000.

– Daniel Falbel

2016/11/17 at 12:03
Yes, but in this bank of 1.700.000 he does not know the classes (and wants to use a classifier to separate them). If he uses the 1,700,000 as "the rest", I think the chances of his classifier going wrong seem great. No?

– Luiz Vieira

2016/11/17 at 12:04
1

Moreover, I do not think he said that the 8000 cases of class A are the only ones (that is, that in the 1,700,000 there are no samples of A). In fact, he says: "I know that most non-classified observations would be type B to Z (in the binary case FALSE)." I mean, he knows that most are not type A, but there may be A samples in the 1,700,000. :/

– Luiz Vieira

2016/11/17 at 12:06
I understood the question in a totally different way, but I think you’re right. You think it’s best to delete the answer?

– Daniel Falbel

2016/11/17 at 12:55
Anyway, a classifier built this way is better than random or at least equal.

– Daniel Falbel

2016/11/17 at 12:59
No, leave the answer. It’s still useful, because you’re right that it might still be better than choosing randomly. But I still think AP needs to do a better analysis of the data (I suggested using Clusterization). If he can separate negative examples to use in training, his classifier looks much better. :)

– Luiz Vieira

2016/11/17 at 13:17
1

Maybe just edit to add to your final observation (that a classifier built like this might not look good, but it will be better than choosing randomly). In fact, if the 1,700,000 actually have few samples of class A, the classifier will err more in false negative than in false positive. Maybe this will be very useful for the AP.

– Luiz Vieira

2016/11/17 at 13:19

Show 3 more comments