Data Mining - statistical tests to compare the results of classifiers

Question

Data Mining - statistical tests to compare the results of classifiers

Asked 6 years, 10 months ago

Viewed 47 times

0

I have to use a statistical test to compare the results of various classifiers. The problem is that I don’t know which statistical test to use, because I only have access to the mean and standard deviation of the result of each classifier.

Results of the algorithms:

Naive Bayes: - Average: 97,50 - Standard Deviation: 7,91

C4.5: - Average: 77,50 - Standard Deviation: 23,71

SVM: - Average: 98,00 - Standard Deviation: 6,32

1-NN: - Average: 75,50 - Standard Deviation: 21,27

3-NN: - Average: 77,00 - Standard Deviation: 17,51

5-NN: - Average: 75,00 - Standard Deviation: 23,69

7-NN: - Average: 73,00 - Standard Deviation: 18,74

Any idea of parametric testing that can compare the results of these algorithms only with this information ?

Welcome to Leonardo Amaral, although this has nothing to do with the focus of the site, I’ll leave you an answer. If the answer is useful, just mark as accepted, see https://i.stack.Imgur.com/jx7Ts.png and why https://pt.meta.stackoverflow.com/questions/1078/como-e-por-que-aceitar-uma-resposta/1079#1079. Please also know https://answall.com/help/mcve

– user60252

2018/10/11 at 00:52

1 answer

Browser other questions tagged artificial-intelligence

You are not signed in. Login or sign up in order to post.

by user60252 · Answer 1 · 2018-10-11T00:48:43+00:00

The concepts

The arithmetic mean outlet alone is a dangerous tool. In a data set with a high number of outliers* or a distorted distribution, the average simply does not provide the precision needed for a correct decision.

*Outliers are data that differ drastically from all others, are points outside the curve. In other words, an outlier is a value that escapes normality and can (and probably will) cause anomalies in the results obtained through algorithms and analysis systems.

The standard deviation translates the variation of a data set around the mean, that is, the greater or lesser variability of the results obtained. It allows identify the extent to which the results are concentrated or not around the trend of a set of observations. The greater the dispersion, the lower the concentration and viceversa.

A small standard deviation may be a goal in certain situations where results are limited, for example in product manufacturing and quality control. A particular type of part of the car that has to be 2 inches in diameter to fit properly should not have a very large standard deviation during the manufacturing process. A large standard deviation in this case would mean that many parts end up in the trash, because they wouldn’t fit right; either this or the cars will have problems on the way.

In a perfect normal distribution, 68.26% of the occurrences will be concentrated in the area of the graph marked by a standard deviation to the right and a standard deviation to the left of the midline

In your case, which I’m not quite sure about, we have:

M is the mean and SD is the standard deviation

 ------------------------------------------------------------
 classificador      DP       M-DP          M         M+DP
 ------------------------------------------------------------
     Naive         7,91     89,59        97,50      105,41
     C4.5          23,71    53,79        77,50      101,21
     SVM           6.32     91,68        98,00      104,32
     1-NN          21,27    54,23        75,50      96,77
     3-NN          17,51    59,49        77,00      94,51
     5-NN          23,69    51,31        75,00      98,69
     7-NN          18,74    54,26        73,00      91,74

What can we conclude?

Approximately 68% of the classifier Naive focus between 89,59 and 105,41

Approximately 68% of the classifier C4.5 focus between 53,79 and 101,21

Approximately 68% of the classifier SVM focus between 91,68 and 104,32

And so on and so forth .....

So if we want a classifier with 68% between 92 and 100, it certainly would be SVM

You can continue the table by placing the values of M-2DP and M+2DP

When we demarcated two standard deviations, to the right and left of the mean, we covered 95.44% of the occurrences and 99.72% when we demarcated three.