Bag of words in Python

Asked

Viewed 299 times

1

I have a news dataset and I want to separate them between two classes. For this I thought about using Bag of words, but I’m not getting it with Sklearn. I tried the following:

#Bag of words
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
print(vectorizer.fit_transform(traindata).todense())
print(vectorizer.vocabulary_)

Any indication how to use Bag of words using Pandas, Sklearn etc?

1 answer

3


Try using the lib Gensim to preprocess your data. It helps to produce the vectorization you need in the Sklearn model input.

Try something like that:

from gensim.parsing.preprocessing import preprocess_string, DEFAULT_FILTERS, stem_text
from gensim.corpora import Dictionary

documents = json.load(open('./data/jsons/unprocessed_documents.json'))

def concat_document(doc):
    return 5 * doc['title'] + " " + 3 * " ".join([*doc['topics']]) + " " + doc['abstract']

CUSTOM_FILTERS = set(DEFAULT_FILTERS) - set([stem_text])

preprocessed_texts = []

for base_doc in documents:
    concat_doc = concat_document(base_doc)
    preprocessed_texts.append(preprocess_string(concat_doc, CUSTOM_FILTERS))


dic = Dictionary(preprocessed_texts)

corpus_bow = [dic.doc2bow(doc) for doc in preprocessed_texts]

print(corpus_bow[0])

> [(0, 1), (1, 1), (2, 2), (3, 1), (4, 3), (5, 19), (6, 4), (7, 2), (8, 1), (9, 5), (10, 6), (11, 2), (12, 9), (13, 6), (14, 1), (15, 5), (16, 3), (17, 1), (18, 1), (19, 16), (20, 2), (21, 1), (22, 1), (23, 1), (24, 3), (25, 18), (26, 1), (27, 3), (28, 1), (29, 1), (30, 3), (31, 1), (32, 3), (33, 1), (34, 8), (35, 24), (36, 3), (37, 6), (38, 3), (39, 1), (40, 1), (41, 1), (42, 2), (43, 3), (44, 5), (45, 4), (46, 1), (47, 2), (48, 5), (49, 6), (50, 9), (51, 6), (52, 1), (53, 15), (54, 31), (55, 6), (56, 5), (57, 15), (58, 1), (59, 1), (60, 2), (61, 14), (62, 1), (63, 1), (64, 5), (65, 6), (66, 1), (67, 2), (68, 1), (69, 6), (70, 3), (71, 15), (72, 1)]

With Bow ready, you just need to put it in the format that your model accepts, an array numpy sparse, for example.

  • 1

    Understood, Thomas. The question itself was quite comprehensive Alguma indicação de como usar Bag of words usando Pandas, Sklearn etc?. =[&#I will try to improve my answer.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.