Try using the lib Gensim to preprocess your data. It helps to produce the vectorization you need in the Sklearn model input.
Try something like that:
from gensim.parsing.preprocessing import preprocess_string, DEFAULT_FILTERS, stem_text
from gensim.corpora import Dictionary
documents = json.load(open('./data/jsons/unprocessed_documents.json'))
def concat_document(doc):
return 5 * doc['title'] + " " + 3 * " ".join([*doc['topics']]) + " " + doc['abstract']
CUSTOM_FILTERS = set(DEFAULT_FILTERS) - set([stem_text])
preprocessed_texts = []
for base_doc in documents:
concat_doc = concat_document(base_doc)
preprocessed_texts.append(preprocess_string(concat_doc, CUSTOM_FILTERS))
dic = Dictionary(preprocessed_texts)
corpus_bow = [dic.doc2bow(doc) for doc in preprocessed_texts]
print(corpus_bow[0])
> [(0, 1), (1, 1), (2, 2), (3, 1), (4, 3), (5, 19), (6, 4), (7, 2), (8, 1), (9, 5), (10, 6), (11, 2), (12, 9), (13, 6), (14, 1), (15, 5), (16, 3), (17, 1), (18, 1), (19, 16), (20, 2), (21, 1), (22, 1), (23, 1), (24, 3), (25, 18), (26, 1), (27, 3), (28, 1), (29, 1), (30, 3), (31, 1), (32, 3), (33, 1), (34, 8), (35, 24), (36, 3), (37, 6), (38, 3), (39, 1), (40, 1), (41, 1), (42, 2), (43, 3), (44, 5), (45, 4), (46, 1), (47, 2), (48, 5), (49, 6), (50, 9), (51, 6), (52, 1), (53, 15), (54, 31), (55, 6), (56, 5), (57, 15), (58, 1), (59, 1), (60, 2), (61, 14), (62, 1), (63, 1), (64, 5), (65, 6), (66, 1), (67, 2), (68, 1), (69, 6), (70, 3), (71, 15), (72, 1)]
With Bow ready, you just need to put it in the format that your model accepts, an array numpy sparse, for example.
Understood, Thomas. The question itself was quite comprehensive
Alguma indicação de como usar Bag of words usando Pandas, Sklearn etc?
. =[&#I will try to improve my answer.– Cassiano