Sklearn - error in model training

Asked

Viewed 100 times

2

I’m trying to sort with sklearn, but I’m getting an error:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn import metrics

X = df['texto'].values #texto que é a base para classificação
Y = df['sentimento'].values #sentimento é o que será treinado. Obs, a coluna setimento já está preenchida com o devido sentimento para cada texto (seguro, inseguro ou nêutro)
split_test_size = 0.30 #30% para teste e 70% para treino

#dividindo o modelo
X_treino, X_teste, Y_treino, Y_teste = train_test_split(X, Y, test_size = split_test_size, random_state = 42)

modelo_v1 = GaussianNB()

#treinando o modelo
    modelo_v1.fit(X_treino, Y_treino.ravel())

Returns the error:

Traceback (Most recent call last): File "C: Users USUARIO workspacePython tests for exampleClassificacaTwitter2.py", line 280, in main() File "C: Users USUARIO workspacePython tests for exampleClassificacaTwitter2.py", line 65, in main classificar3(df, "I’m afraid of violence") File "C: Users USUARIO workspacePython tests for exampleClassificacaTwitter2.py", line 277, in classificar3 modelo_v1.fit(x_workout, Y_workout.Ravel()) File "C: Programdata Anaconda3 lib site-Packages sklearn naive_bayes.py", line 182, in fit X, y = check_X_y(X, y) File "C: Programdata Anaconda3 lib site-Packages sklearn utils validation.py", line 521, in check_X_y ensure_min_features, warn_on_dtype, Estimator) File "C: Programdata Anaconda3 lib site-Packages sklearn utils validation.py", line 382, in check_array array = np.array(array, dtype=dtype, order=order, copy=copy) Valueerror: could not Convert string to float: 'I only feel comfortable in a quiet place'

Does it not work with string? Or would I have to take the number of the frequency of words?

  • Good afternoon, @André Nascimento. I believe you are using the original texts (raw data) instead of the characteristics extracted from the texts with some characteristic extraction technique (text Features from Feature Extraction technique). Countvectorizer is in your code but I have not seen it being used. Using it can help.

1 answer

0

As the sklearn documentation "working with textual data" found on this link, some feature extractors are easily available with the module to be used.

A simple example is the CountVectorizer that creates bag of words with n-Grams of all sizes, already dealing with the stopwords and other common preprocessings.

A simple example code shown in the documentation is this:

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

I hope I’ve helped.

P.S.: Stay well and stay at home. (coronavirus times COVID-19).

Browser other questions tagged

You are not signed in. Login or sign up in order to post.