Code evaluation: Logistic regression with K fold validation. Is that correct?

Asked

Viewed 590 times

0

The code below is an attempt to make a logistic regression with k fold cross validation. The idea is to take the confusion matrices generated in each fold and then generate an average confounding matrix, with 95% confidence (confidence interval for the average of 95%).

Is the code making sense? Any suggestions for improvement/correction?

import numpy as np
from sklearn import model_selection
from sklearn import datasets
from sklearn import svm
import pandas as pd
from sklearn.linear_model import LogisticRegression
from scipy.stats import sem, t
from scipy import mean

lista_matrizes = []


UNSW = pd.read_csv('/home/sec/Desktop/CEFET/UNSW_NB15_testing-set.csv')

previsores = UNSW.iloc[:,UNSW.columns.isin(('sload','dload',
                                                   'spkts','dpkts','swin','dwin','smean','dmean',
'sjit','djit','sinpkt','dinpkt','tcprtt','synack','ackdat','ct_srv_src','ct_srv_dst','ct_dst_ltm',
 'ct_src_ltm','ct_src_dport_ltm','ct_dst_sport_ltm','ct_dst_src_ltm')) ].values


classe= UNSW.iloc[:, -1].values

#iris = datasets.load_iris()
#print(iris.data.shape, iris.target.shape)

X_train, X_test, y_train, y_test = model_selection.train_test_split(
previsores, classe, test_size=0.4, random_state=0)

print(X_train.shape, y_train.shape)
#((90, 4), (90,))
print(X_test.shape, y_test.shape)
#((60, 4), (60,))

logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
print(previsores.shape)

#clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
print(logmodel.score(X_test, y_test) ) 


#Computing cross-validated metrics

logmodel = LogisticRegression()
scores = model_selection.cross_val_score(
    logmodel, previsores, classe, cv=30)

print(scores)                                             
#array([ 0.96...,  1.  ...,  0.96...,  0.96...,  1.        ])
#print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

########K FOLD
print('########K FOLD########K FOLD########K FOLD########K FOLD')
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix

kf = KFold(n_splits=3, random_state=None, shuffle=False)
kf.get_n_splits(previsores)
for train_index, test_index in kf.split(previsores):

    X_train, X_test = previsores[train_index], previsores[test_index]
    y_train, y_test = classe[train_index], classe[test_index]

    logmodel.fit(X_train, y_train)
    print (confusion_matrix(y_test, logmodel.predict(X_test)))

    lista_matrizes.append(confusion_matrix(y_test, logmodel.predict(X_test)))


#print(lista_matrizes)

final = np.mean(lista_matrizes, axis=0)
print(f" Mean confidence Matrix  \n{final}")


# o intervalo de confiança
def mean_confidence_interval(data, confidence=0.95):

    #data = [1, 2, 3, 4, 5]

    n = len(data)
    m = mean(data)
    std_err = sem(data)
    h = std_err * t.ppf((1 + confidence) / 2, n - 1)

    start = m - h
    #print (start)


    end = m + h
    #print (end)

    return start, end




print()
print(f"Intervalo de confiança: \n{mean_confidence_interval(final)}")

1 answer

2

Your code is correct. However a bit messy. by what you explained in the question statement is not necessary the first part above the k-fold. Another advice is to import all libs at the beginning of the code so it doesn’t create confusion.

I usually turn these procedures into roles to create a pipeline to generate reports for future models. Another interesting thing to evaluate in your code is the:

sklearn.metrics.classification_report

This function returns a report containing accuracy, recall, and F1-score are extremely important metrics for rating problems. Functional variables and names: it is important to define well, because as your code grows you may end up not being able to enjoy what you did.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.