Text mining with Scikit-Learn

Question

Text mining with Scikit-Learn

Asked 7 years, 12 months ago

Viewed 449 times

0

I’m doing some research in the area of feeling analysis, so I’m running some tests on a text database to get results. I was looking for tutorials among other forms of information on the internet and I came to the conclusion that is well used python scikit library. However, I am having trouble getting this library to work. Any help is welcome.

Class Init

import codecs
import baseline


def loadContent():
positiveData = codecs.open('opinioesNegativas.txt', 'r', encoding='utf8').readlines()

file = codecs.open('opinioesPositivas.txt', 'r', encoding='utf8')
negativeDate = file.readlines()

data_set = [0 for i in range(2000)]
label_set = [0 for i in range(2000)]

data_set[:1000] = positiveData
data_set[1000:] = negativeDate

for i in range(2000):
    if i < 1000:
        label_set[i] = "p"
    else:
        label_set[i] = "n"

return data_set, label_set


def run_baseline():
# getting the data#
data_set, label_set = loadContent()
baseline_classifier = baseline
# Pre-processing and setting the data to train and test model#

data_set = baseline_classifier.data_TFIDF_transform(data_set)
# data_set = baseline_classifier.data_transform(data_set)

folds = 10
scores = baseline_classifier.runKFoldCrossValitation(data_set, label_set, folds)

return scores

scores = run_baseline()

print(scores)
print("Baseline Accuracy: {} +/- {}".format(scores.mean(), scores.std() ** 2))

print(scores)
print("Stylometric Accuracy: {} +/- {}".format(scores.mean(), scores.std() ** 2))

Baseline class

from sklearn.cross_validation import StratifiedShuffleSplit, cross_val_score, train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.multiclass import OneVsOneClassifier
from sklearn.svm import SVC

classifier = OneVsOneClassifier(SVC(kernel='linear', random_state=84, probability=True))


# training method #
def buildModel(train, labels):
# train_transformed = tf_idf.fit_transform(train)
classifier.fit(train, labels)


# predicted method #
def predict(test_data):
# test_transformed = tf_idf.fit_transform(test_data)
return classifier.predict(test_data)


# Pre-processing and setting the data to train and test model#
def data_transform(data_set):
transform = CountVectorizer(ngram_range=(1, 1))
data_set = transform.fit_transform(data_set)
return data_set


def data_TFIDF_transform(data_set):
tf_idf = TfidfVectorizer(ngram_range=(1, 1))
data_set = tf_idf.fit_transform(data_set)

return data_set


def runKFoldCrossValitation(data_set: object, label_set: object, folds: object) -> object:
classifier = OneVsOneClassifier(SVC(kernel='linear', random_state=84, probability=True))
# Split Data
train_data, test_data, train_label, test_label = train_test_split(data_set, label_set, test_size=0.1,
                                                                  random_state=0)

# Class Stratified 10-fold Cross Validation
skf = StratifiedShuffleSplit(n_splits=folds)

# Cross Validation
scores = cross_val_score(classifier, test_data, test_label, cv=skf)

return scores

Error presented

Traceback (most recent call last):
File "C:/Users/Jeferson/PycharmProjects/NewProject/TestePython.py", line 40, in <module>
   scores = run_baseline()
File "C:/Users/Jeferson/PycharmProjects/NewProject/TestePython.py", line 36, in run_baseline
   scores = baseline_classifier.runKFoldCrossValitation(data_set, label_set, folds)
File "C:\Users\Jeferson\PycharmProjects\NewProject\baseline.py", line 42, in runKFoldCrossValitation
   skf = StratifiedShuffleSplit(n_splits=folds)
TypeError: __init__() got an unexpected keyword argument 'n_splits'

1 answer

Browser other questions tagged python machine-learning text-pattern machine-learning

You are not signed in. Login or sign up in order to post.

by Sidon • **6,563** points · Answer 1 · 2017-07-30T14:34:06+00:00

Which version of scikit are you using? In what Voce called the baseline class, soon start you make an import like this:

from sklearn.cross_validation import StratifiedShuffleSplit

But if consult the documentation, will verify that this is a version class 0.17 and that there is no nominal parameter n_splits.

sklearn.cross_validation.StratifiedShuffleSplit(y, n_iter=10, test_size=0.1, 
train_size=None, random_state=None)

And if you consult the changelog of the version 0.18.1 will verify the following observation:

All cross-validation Utilities in sklearn.model_selection now Permit one time cross-validation Splitters for the cv Parameter. Also non-deterministic cross-validation Splitters (Where Multiple calls to split Produce dissimilar splits) can be used as cv Parameter. The sklearn.model_selection.Gridsearchcv will cross-validate each Parameter Setting on the split produced by the first split call to the cross-validation Splitter.

Or maybe you have to update the lib and use sklearn.model_selection.StratifiedShuffleSplit, which is another class, which receives other parameters, including n_splits.

sklearn.model_selection.StratifiedShuffleSplit(n_splits=10, test_size=0.1, 
train_size=None, random_state=None)