How to change Threshold from a classification model?

Asked

Viewed 83 times

1

In addition to providing classes directly, some models of machine Learning generate probability vectors of belonging to each of the classes for each observation in the sample. The class predicted by each observation is the class that exceeds a parameter set by the researcher. This parameter is called Threshold and has as default 0.5.

As my sample is very unbalanced, I would like to change this Threshold to obtain greater sensitivity (true positional rate), even if it costs a little specificity (inverse of the false negative rate). It is possible to observe this trade-off in ROC. Follows an example:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve
from sklearn.metrics import auc

sns.set_style("whitegrid")

data = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

from sklearn.model_selection import train_test_split

X=data.iloc[:,1:]
y=data['admit']

from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

number_nb=1
logistic = LogisticRegression()
logistic.fit(X_train, y_train)
pred = logistic.predict(X_test)

y_scores = logistic.predict_proba(X_test)
fpr, tpr, threshold = roc_curve(y_test, y_scores[:, 1])
roc_auc = auc(fpr, tpr)

fig, ax=plt.subplots(figsize=(6,8))

ax.plot(fpr, tpr, 'k', label = 'AUC = %0.2f' % roc_auc)
ax.legend(loc = 'lower right')
ax.plot([0, 1], [0, 1],'k--')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.set_ylabel('Sensibilidade')
ax.set_xlabel('1-Especificidade')

plt.show()

inserir a descrição da imagem aqui

What I’d like to know is, how do I identify the combination of sensitivity and specificity associated with 50% Threshold? How do I run a model with a Threshold value suitable for the desired combination of sensitivity and specificity?

Here an example taken from the book "Apllied predictive models":

inserir a descrição da imagem aqui

1 answer

2


The solution is simply to generate a new predictive vector from the probability vector of the reference class.

To get the probability vectors, just do:

y_scores=logistic.predict_proba(X_test)

For example data having two classes, y_scores has two columns. The second column presents the probability vector of belonging to the reference class. Now we can generate a new predictive vector:

pred2=pd.Series(y_scores[:,1]).map(lambda x: 1 if x > threshold else 0)

Where threshold is the value chosen by the researcher. To see the difference, compare now the classification report with the two predictive vectors:

from sklearn.metrics import classification_report

print(classification_report(y_test, pred))

Returns:

              precision    recall  f1-score   support

           0       0.72      0.97      0.82        90
           1       0.73      0.19      0.30        42

    accuracy                           0.72       132
   macro avg       0.72      0.58      0.56       132
weighted avg       0.72      0.72      0.66       132

And:

threshold=0.3
pred2=pd.Series(y_scores[:,1]).map(lambda x: 1 if x > threshold else 0)

print(classification_report(y_test, pred2))

Returns:

              precision    recall  f1-score   support

           0       0.86      0.56      0.68        90
           1       0.46      0.81      0.59        42

    accuracy                           0.64       132
   macro avg       0.66      0.68      0.63       132
weighted avg       0.73      0.64      0.65       132

Note that the change increased the sensitivity of the model from 0.19 to 0.81

Browser other questions tagged

You are not signed in. Login or sign up in order to post.