Text mining with Scikit-Learn

Asked

Viewed 449 times

0

I’m doing some research in the area of feeling analysis, so I’m running some tests on a text database to get results. I was looking for tutorials among other forms of information on the internet and I came to the conclusion that is well used python scikit library. However, I am having trouble getting this library to work. Any help is welcome.

Class Init

import codecs
import baseline


def loadContent():
positiveData = codecs.open('opinioesNegativas.txt', 'r', encoding='utf8').readlines()

file = codecs.open('opinioesPositivas.txt', 'r', encoding='utf8')
negativeDate = file.readlines()

data_set = [0 for i in range(2000)]
label_set = [0 for i in range(2000)]

data_set[:1000] = positiveData
data_set[1000:] = negativeDate

for i in range(2000):
    if i < 1000:
        label_set[i] = "p"
    else:
        label_set[i] = "n"

return data_set, label_set


def run_baseline():
# getting the data#
data_set, label_set = loadContent()
baseline_classifier = baseline
# Pre-processing and setting the data to train and test model#

data_set = baseline_classifier.data_TFIDF_transform(data_set)
# data_set = baseline_classifier.data_transform(data_set)

folds = 10
scores = baseline_classifier.runKFoldCrossValitation(data_set, label_set, folds)

return scores

scores = run_baseline()

print(scores)
print("Baseline Accuracy: {} +/- {}".format(scores.mean(), scores.std() ** 2))

print(scores)
print("Stylometric Accuracy: {} +/- {}".format(scores.mean(), scores.std() ** 2))

Baseline class

from sklearn.cross_validation import StratifiedShuffleSplit, cross_val_score, train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.multiclass import OneVsOneClassifier
from sklearn.svm import SVC

classifier = OneVsOneClassifier(SVC(kernel='linear', random_state=84, probability=True))


# training method #
def buildModel(train, labels):
# train_transformed = tf_idf.fit_transform(train)
classifier.fit(train, labels)


# predicted method #
def predict(test_data):
# test_transformed = tf_idf.fit_transform(test_data)
return classifier.predict(test_data)


# Pre-processing and setting the data to train and test model#
def data_transform(data_set):
transform = CountVectorizer(ngram_range=(1, 1))
data_set = transform.fit_transform(data_set)
return data_set


def data_TFIDF_transform(data_set):
tf_idf = TfidfVectorizer(ngram_range=(1, 1))
data_set = tf_idf.fit_transform(data_set)

return data_set


def runKFoldCrossValitation(data_set: object, label_set: object, folds: object) -> object:
classifier = OneVsOneClassifier(SVC(kernel='linear', random_state=84, probability=True))
# Split Data
train_data, test_data, train_label, test_label = train_test_split(data_set, label_set, test_size=0.1,
                                                                  random_state=0)

# Class Stratified 10-fold Cross Validation
skf = StratifiedShuffleSplit(n_splits=folds)

# Cross Validation
scores = cross_val_score(classifier, test_data, test_label, cv=skf)

return scores

Error presented

Traceback (most recent call last):
File "C:/Users/Jeferson/PycharmProjects/NewProject/TestePython.py", line 40, in <module>
   scores = run_baseline()
File "C:/Users/Jeferson/PycharmProjects/NewProject/TestePython.py", line 36, in run_baseline
   scores = baseline_classifier.runKFoldCrossValitation(data_set, label_set, folds)
File "C:\Users\Jeferson\PycharmProjects\NewProject\baseline.py", line 42, in runKFoldCrossValitation
   skf = StratifiedShuffleSplit(n_splits=folds)
TypeError: __init__() got an unexpected keyword argument 'n_splits'

1 answer

0


Which version of scikit are you using? In what Voce called the baseline class, soon start you make an import like this:

from sklearn.cross_validation import StratifiedShuffleSplit

But if consult the documentation, will verify that this is a version class 0.17 and that there is no nominal parameter n_splits.

sklearn.cross_validation.StratifiedShuffleSplit(y, n_iter=10, test_size=0.1, 
train_size=None, random_state=None)

And if you consult the changelog of the version 0.18.1 will verify the following observation:

All cross-validation Utilities in sklearn.model_selection now Permit one time cross-validation Splitters for the cv Parameter. Also non-deterministic cross-validation Splitters (Where Multiple calls to split Produce dissimilar splits) can be used as cv Parameter. The sklearn.model_selection.Gridsearchcv will cross-validate each Parameter Setting on the split produced by the first split call to the cross-validation Splitter.

Or maybe you have to update the lib and use sklearn.model_selection.StratifiedShuffleSplit, which is another class, which receives other parameters, including n_splits.

sklearn.model_selection.StratifiedShuffleSplit(n_splits=10, test_size=0.1, 
train_size=None, random_state=None)
  • Exactly I was using version 0.17 in which I was in the Conda that and the set I was using from python. Currently I upgraded to 0.18.1 with python 3.5.

  • This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are Moved. Also note that the interface of the new CV Iterators are Different from that of this module. This module will be Removed in 0.20. "This module will be Removed in 0.20." Deprecationwarning)

  • According to my code how would be my import? the parameters so passed correct?

  • msg says exactly what I answered to you, in fact, for what you present here is not even an error msg but a "Warming", what you have to do is use the new class, as I suggest in the last paragraph.

  • @Rivaldo Hater you got?

  • Yes, with a few changes. @Andrénascimento

  • Excellent response. + 1

Show 2 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.