Doubt in models Predict

Asked

Viewed 61 times

0

I think it’s a simple question, but in all the courses I’m taking the instructor teaches you to separate training and test data from a csv or some base. But I want to test with the user input instead, but when I try, it says it needs to be the same size as the workouts, there’s no way to test only one input?

Example: Meu dataframe

I am using the treatment_5 column with the following code

tfidf = TfidfVectorizer(lowercase=False)
vetor_tfidf = tfidf.fit_transform(resenha["tratamento_5"])
treino, teste, classe_treino, classe_teste = train_test_split(vetor_tfidf,
                                                              resenha["classificacao"],
                                                              random_state = 42)
regressao_logistica.fit(treino, classe_treino)
acuracia_tfidf = regressao_logistica.score(teste, classe_teste)
print(regressao_logistica.predict(teste).tolist())

This code separates test and training data and predicts with test data.

But I want to do something with user interaction, ie a text inserted by the user, I tried this way:

vetor_tfidf2 = tfidf.fit_transform(["Esse filme foi muito bom, gostei dos movimentos de ação do inicio até o final do filme"])
treino, teste, classe_treino, classe_teste = train_test_split(vetor_tfidf2,
                                                              resenha["classificacao"],
                                                              random_state = 42)
regressao_logistica.fit(treino, classe_treino)
acuracia_tfidf = regressao_logistica.score(teste, classe_teste)
print(regressao_logistica.predict(teste).tolist())

print(vetor_tfidf2.shape)
print(resenha['classificacao'].shape)

But the following error returns to me

Valueerror: Found input variables with inconsistent Numbers of samples: [1, 49459]

This seems to me to be because the training and testing data are different sizes, but how can I do it with just one sentence and not using the dataframe as I tried?

1 answer

1

First, the error occurs in

treino, teste, classe_treino, classe_teste = train_test_split(vetor_tfidf2,
                                                              resenha["classificacao"],
                                                              random_state = 42)

This is because vetor_tfidf2 has only 1 item, and resenha["classificacao"] has 49459. To use train_test_split(X, y), X and y must have the same amount of values. Each entry in X is equivalent to a classification in y. Ex: X[0] has the classification y[0].

How to test only one input?

With the first part of your code:

tfidf = TfidfVectorizer(lowercase=False)
vetor_tfidf = tfidf.fit_transform(resenha["tratamento_5"])
treino, teste, classe_treino, classe_teste = train_test_split(vetor_tfidf,
                                                              resenha["classificacao"],
                                                              random_state = 42)
regressao_logistica.fit(treino, classe_treino)

We already have a tfidf and a regressao_logistica trained. That is to say, we will not retrain them. Then the line vetor_tfidf2 = tfidf.fit_transform(["Frase que quero testar"]) is inconsistent. It is only right to transform the phrase with the tfidf already trained. This is done with the function .transform(), thus:

vetor_tfidf2 = tfidf.transform(["Frase que quero testar"])

Since we already have the model trained, separating new phrases that we want to test in training and testing and retraining the model doesn’t make any sense. All we need to do is take the phrases and use the model. That’s why the line that’s going wrong, the second train_test_split(), can be completely discarded.

The following lines, regressao_logistica.fit(treino, classe_treino) and acuracia_tfidf = regressao_logistica.score(teste, classe_teste) should also be discarded. We do not want to retrain the model.

Finally, your second code block (to test new sentences) should be:

# uso o tfidf já treinado para transformar a nova frase
vetor_tfidf2 = tfidf.transform(["Frase que quero testar"])

# aplico a frase transformada como entrada do modelo já treinado
print(regressao_logistica.predict(vetor_tfidf2).tolist())

One caveat is that the model was trained with phrases that had undergone treatment. To use new sentences, it is fair to apply the same treatment to new sentences before using them as template input.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.