Doubts about using Stratified K-Fold in Scikit Learn

Question

Doubts about using Stratified K-Fold in Scikit Learn

Asked 10 years, 5 months ago

Viewed 2,529 times

6

I want to perform a cross-validation using 10 folds stratified, using the language Python and the library Scikit Learn (Sklearn).

Looking for some tutorials on the internet, I did some testing and generated a template as follows:

def decisionTreeOnIrisDataset(self):
    print("\nAlgoritmos Decision Tree - Iris Dataset")
    features, labels = self.extractFeaturesAndLabels() # 0.0 para Setosa
                                                       # 1.0 para Versicolor
                                                       # 2.0 para virginica

    model = DecisionTreeClassifier()
    model.fit(features, labels)
    #print(model)

    expected = labels
    predicted = model.predict(features)
    self.printInfo(expected, predicted)

And I print out some information with:

def printInfo(self, expected, predicted):
    print("Relatório de Classificação")
    print(metrics.classification_report(expected, predicted))
    print("Matriz de Confusão")
    print(metrics.confusion_matrix(expected, predicted))

I have little experience with Machine Learning and python, but I’ve had an experience with the library Mllib of Apache Spark using Scala, but always divided for example 80% for training and 20% for testing.

Searching in the documentation I found the class StratifiedKFold, but I couldn’t figure out how to use it to verify the accuracy and other measurements of the model I generated.

In the documentation I found an example, which is simply passed the Labels and the amount of Folds, adapting got like this:

features, labels = extractFeaturesAndLabels()

skf = StratifiedKFold(labels, n_folds=3)

for train, test in skf:
    print("Train\n%s\n\n Test\n%s" % (train, test))

I couldn’t understand the contents of Folds, and how I will use them to evaluate my model, the contents of the first fold printed was:

Train
[ 17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  67  68  69
  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87
  88  89  90  91  92  93  94  95  96  97  98  99 117 118 119 120 121 122
 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
 141 142 143 144 145 146 147 148 149]

 Test
[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  50
  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66 100 101
 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116]

Someone can explain me how to proceed, and if possible some example?

1 answer

Browser other questions tagged python machine-learning scipy

You are not signed in. Login or sign up in order to post.

by Luiz Vieira • **34,160** points · Answer 1 · 2015-06-01T00:11:50+00:00

The goal of testing with a classifier is to check your quality in predicting the ranking for one or more new examples of problem domain data of interest (I don’t know how well you know the subject, but if you think it’s interesting read this my other answer for a teaching example introducing the subject of classification).

This test is usually performed as follows:

The classifier is trained with a data set called training data, generating a model (also called a classifier or predictor).
Then, the generated model runs with a dataset called test data in which is known the result of the classification for each example (i.e., it is already known which class each characteristic vector belongs to).
The result of the model forecast is then compared with the correct result (the one already known to be correct) to verify the amounts of hits and errors. From these numbers an indication of the quality of the generated model is extracted.

The problem is that you don’t always have a large enough amount of data to separate into training and test data. The "real world" is vast, and collecting a lot of diverse data can be costly and sometimes even unfeasible. And testing with all the data used in the training is useless, since it will always hit the "predictions" (after all, it was trained with that data set).

The following code example demonstrates this type of test, and just to help calculate the score (the percentage of hits) and displays the matrix of confusion (which shows on a visual scale the amounts of hits and errors between the different classes). Like you, I also used the classic database (orchid data) Iris Dataset for the tests.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix

##################################################
# Função auxiliar para a construção dos gráficos
# com a matriz de confusão.
##################################################
def plot_cm(cm, cm_norm):
    plt.figure()
    plt.title(u'Matriz de Confusão')

    a = plt.subplot(121)
    a.set_title(u"Matriz de Confusão Regular", fontsize=18)
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
    plt.colorbar(fraction=0.046, pad=0.04)

    tick_marks = np.arange(len(iris.target_names))
    plt.xticks(tick_marks, iris.target_names, rotation=45)
    plt.yticks(tick_marks, iris.target_names)
    plt.ylabel(u'Classe Verdadeira', fontsize=16)
    plt.xlabel(u'Classe Estimada', fontsize=16)

    b = plt.subplot(122)
    b.set_title(u"Matriz de Confusão Normalizada", fontsize=18)
    plt.imshow(cm_norm, interpolation='nearest', cmap=plt.cm.Blues)
    plt.colorbar(fraction=0.046, pad=0.04)

    plt.xticks(tick_marks, iris.target_names, rotation=45)
    plt.yticks(tick_marks, iris.target_names)
    plt.ylabel(u'Classe Verdadeira', fontsize=16)
    plt.xlabel(u'Classe Estimada', fontsize=16)

    plt.tight_layout()
    plt.show()

##################################################


# Importa o banco de dados Iris
iris = datasets.load_iris()

# Define os dados de interesse para o problema
# X é o vetor de Características (no seu exemplo, você chamou de "features")
# Y é o vetor de Classes (no seu exemplo, você chamou de "labels")
X = iris.data
Y = iris.target

# Instancia o algoritmo desejado (no caso, uma Árvore de Decisão)
model = DecisionTreeClassifier()

# Treina o modelo com base EM TODOS os dados de treinamento
model.fit(X, Y)

# Verificação do modelo treinado (estimação das classes a partir das características)
Y_pred = model.predict(X)

# Imprime o score
score = model.score(X, Y)
print(u"Score: {0:.2f}").format(score)

# Cria a matriz de confusão regular e normalizada
cm = confusion_matrix(Y, Y_pred)
cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

# Imprime as matrizes de confusão
np.set_printoptions(precision=2)
print(u'Matriz de Confusão Regular')
print(cm)
print(u'Matriz de Confusão Normalizada')
print(cm_norm)

# Plota as matrizes em um gráfico
plot_cm(cm, cm_norm)

The result of this code is as follows::

Score: 1.00
Matriz de Confusão Regular
[[50  0  0]
 [ 0 50  0]
 [ 0  0 50]]
Matriz de Confusão Normalizada
[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]]

And the following chart:

inserir a descrição da imagem aqui

As expected, one observes a score equal to 1 (100% correct) and a homogeneous confusion matrix (no confusion at all in fact), as all examples of all classes have been correctly estimated (the amounts of data focus on the quadrants in which true class equals the estimated class).

Since testing with training data is useless, part of the data should be used for training and another part for testing. When you have a small amount of data, arbitrarily dividing it into 80% for training and 20% for testing (for example) is complicated because it can generate significant model errors. After all, what if the most relevant examples of the classes were precisely those 20% that were not used to train the classifier?

Thus, there are some approaches that seek to assist with this difficulty. Cross-validation with the K-Fold method is one of them. The idea is to divide the available data mass into k partitions (such as "Folds") and carry out k Training and test rounds with these data combinations. Thus, ideally the chances of some important data being left out during training are minimized.

You can split the data manually, but the Scikit-Learn library has functions to help with this. Manually splitting requires some precautions as you cannot simply "split" the data array into k parts. By doing this you can eventually leave out of training all the data of a class (In Iris Dataset, the flowers are of three different types - if you leave only type 1 and type 2 data in the training partition and use type 3 data in the tests, the amount of errors will be large). The "Stratified" K-Fold then takes this care, ensuring that there will always be an equivalent percentage of data from each class in each partition (both in the training and test).

inserir a descrição da imagem aqui

Well, when executing the k tests, whether you will get a result (score and confusion matrix) by test. From these results, one can extract a score mean, for example, which will be much closer than expected when your model is used in a real scenario (with new data extracted from the "real world"). There are other partitioning methods, such as Leave-One-Out. This is actually a K-Fold in which the test partition has size 1. I mean, you always train with n-1 dice and test yourself with the rest, repeating this n times (and therefore the execution of this type of test takes longer).

The following code, similar to the one demonstrated by this method:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn import cross_validation
from sklearn.metrics import confusion_matrix

##################################################
# Função auxiliar para a construção dos gráficos
# com a matriz de confusão.
##################################################
def plot_cm(cm, cm_norm):
    plt.figure()
    plt.title(u'Matriz de Confusão')

    a = plt.subplot(121)
    a.set_title(u"Matriz de Confusão Regular", fontsize=18)
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
    plt.colorbar(fraction=0.046, pad=0.04)

    tick_marks = np.arange(len(iris.target_names))
    plt.xticks(tick_marks, iris.target_names, rotation=45)
    plt.yticks(tick_marks, iris.target_names)
    plt.ylabel(u'Classe Verdadeira', fontsize=16)
    plt.xlabel(u'Classe Estimada', fontsize=16)

    b = plt.subplot(122)
    b.set_title(u"Matriz de Confusão Normalizada", fontsize=18)
    plt.imshow(cm_norm, interpolation='nearest', cmap=plt.cm.Blues)
    plt.colorbar(fraction=0.046, pad=0.04)

    plt.xticks(tick_marks, iris.target_names, rotation=45)
    plt.yticks(tick_marks, iris.target_names)
    plt.ylabel(u'Classe Verdadeira', fontsize=16)
    plt.xlabel(u'Classe Estimada', fontsize=16)

    plt.tight_layout()

##################################################


# Importa o banco de dados Iris
iris = datasets.load_iris()

# Define os dados de interesse para o problema
# X é o vetor de Características (no seu exemplo, você chamou de "features")
# Y é o vetor de Classes (no seu exemplo, você chamou de "labels")
X = iris.data
Y = iris.target

# Cria 5 partições com os dados de disponíveis
kf = cross_validation.StratifiedKFold(Y, n_folds=5)

# Treina o modelo com base nos dados de treinamento EM CADA PARTIÇÃO
# e calcula os escores 
round = 1
scores = []

for train_index, test_index in kf:
    X_train, X_test = X[train_index], X[test_index]
    Y_train, Y_test = Y[train_index], Y[test_index]

    # Instancia o algoritmo desejado (no caso, uma Árvore de Decisão)
    model = DecisionTreeClassifier()

    # Treina com a partição de treinamento
    model.fit(X_train, Y_train)

    # Verificação com a partição de teste
    Y_pred = model.predict(X_test)
    score = model.score(X_test, Y_test)
    scores.append(score)

    # Cria a matriz de confusão regular e normalizada
    cm = confusion_matrix(Y_test, Y_pred)
    cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    # Imprime as matrizes de confusão
    np.set_printoptions(precision=2)
    print(u"Rodada #{0} (score: {1:.2f})").format(round, score)
    round = round + 1

    print(u"Partição de treinamento: do índice #{} ao índice #{}").format(train_index[0], train_index[-1])
    print(u"Partição de teste: do índice #{} ao índice #{}").format(test_index[0], test_index[-1])
    print(u"----------------------------")

    print(u'Matriz de Confusão Regular')
    print(cm)
    print(u'Matriz de Confusão Normalizada')
    print(cm_norm)

    plot_cm(cm, cm_norm)

# Imprime o score mínimo, máximo e médio
scores = np.array(scores)
print(u"Score mínimo: {0:.2f} Score máximo: {1:.2f} Score médio: {2:.2f}").format(scores.min(), scores.max(), scores.mean())

# Exibe todas as figuras
plt.show()

The result of this code is text output:

Rodada #1 (score: 0.93)
Partição de treinamento: do índice #1 ao índice #149
Partição de teste: do índice #0 ao índice #147
----------------------------
Matriz de Confusão Regular
[[10  0  0]
 [ 0 10  0]
 [ 0  2  8]]
Matriz de Confusão Normalizada
[[ 1.   0.   0. ]
 [ 0.   1.   0. ]
 [ 0.   0.2  0.8]]
Rodada #2 (score: 0.97)
Partição de treinamento: do índice #0 ao índice #149
Partição de teste: do índice #4 ao índice #143
----------------------------
Matriz de Confusão Regular
[[10  0  0]
 [ 0  9  1]
 [ 0  0 10]]
Matriz de Confusão Normalizada
[[ 1.   0.   0. ]
 [ 0.   0.9  0.1]
 [ 0.   0.   1. ]]
Rodada #3 (score: 0.87)
Partição de treinamento: do índice #0 ao índice #149
Partição de teste: do índice #5 ao índice #144
----------------------------
Matriz de Confusão Regular
[[10  0  0]
 [ 0  9  1]
 [ 0  3  7]]
Matriz de Confusão Normalizada
[[ 1.   0.   0. ]
 [ 0.   0.9  0.1]
 [ 0.   0.3  0.7]]
Rodada #4 (score: 0.97)
Partição de treinamento: do índice #0 ao índice #149
Partição de teste: do índice #1 ao índice #148
----------------------------
Matriz de Confusão Regular
[[10  0  0]
 [ 0  9  1]
 [ 0  0 10]]
Matriz de Confusão Normalizada
[[ 1.   0.   0. ]
 [ 0.   0.9  0.1]
 [ 0.   0.   1. ]]
Rodada #5 (score: 0.97)
Partição de treinamento: do índice #0 ao índice #148
Partição de teste: do índice #2 ao índice #149
----------------------------
Matriz de Confusão Regular
[[10  0  0]
 [ 0  9  1]
 [ 0  0 10]]
Matriz de Confusão Normalizada
[[ 1.   0.   0. ]
 [ 0.   0.9  0.1]
 [ 0.   0.   1. ]]
Score mínimo: 0.87 Score máximo: 0.97 Score médio: 0.94

And five charts like the following:

inserir a descrição da imagem aqui

In these results it is already observed that there are some classification errors (in the graphic illustrated - test with the partition 3 - between the orchids of the classes Versicolor and Virginica). In fact, it can be observed that the minimum score (obtained in this test #3) was 0.87 (87% correct), the maximum score of 0.97 (97% correct) and the average (among all tests) was 0.94 (94% correct). This value is probably the closest to the expected results with real-world data, and so is a good indication of the quality of your model.

The example code has comments that should help you, but on your final question, what the K-Fold function of Scikit-Learn does is return you two arrays with the indexes of the data to be used for training or testing. Note the following lines of the code:

for train_index, test_index in kf:
    X_train, X_test = X[train_index], X[test_index]
    Y_train, Y_test = Y[train_index], Y[test_index]

The variables train_index and test_index are matrices (arrays) with the indexes that function cross_validation.StratifiedKFold rode for you. So, when you do X[train_index] you "filter" only the index data in that array (this type of use, in which an array is accessed from an index array, is a very interesting feature of the Python language). :)