3
I have a large dataset (~ 1,700,000) that I would like to sort. I also have a sample not so small (~ 8,000) classified as one of these classes (say, condition A), but I have none (zero) of the other classes (say, conditions B to Z). In addition, all variables are categorical.
Although there are many categories, I am interested only in one of them (the one that I have some sample, condition A).
Am I able to train the model with only type A observations? If not, how should I overcome this problem?
Is it reasonable to change the form of the problem to a binary type classification (type A would be TRUE and the other types FALSE)? In this case, can I randomly take some of the unclassified remarks and assume that the condition is FALSE? I know that most non-classified observations would be type B to Z (in the binary case FALSE).
From now on, thank you.
It does not seem reasonable to change the problem to a binary type classifier because you do not have training data for two "classes" (A and non-A). With data from only one class there is no way to create a model to decide the class because it cannot learn the distinctions.
– Luiz Vieira
Perhaps an approach that might be useful to you is to use some Clusterization algorithm, such as K-Averages, in the large data set. You don’t know what the classes are, but you seem to know how many are (that would be the value of K in K-Averages). This way, you may be able to have a separation that can be useful to you at least to decrease the amount of data and allow you a later analysis (and then yes, the construction of a binary classifier).
– Luiz Vieira