Data structure, bag of words, doubts about code structure

Asked

Viewed 35 times

0

Hello! I am studying feeling analysis for the conclusion of a scientific initiation, whose theme is "The influence of social networks on the financial market"

In the code are present twitter data extraction models using snscrape.modules.twitter, after extracting the data, just upload the file to the COLAB execution environment. Then begins the implementation of the bag of words model, where we use the CountVectorizerto see the frequency of words. I am very beginner in python and so wanted help to process the data, eliminating conjunctions, prepositions and articles defined and undefined, and users mentioned in the tweets soon after the "@".

In conclusion, I need to put the text as a function of time to construct a graph that says more or less the frequency of these words in time.

The code is in the COLAB on the following link: https://colab.research.google.com/drive/1QGb8vsipDq8wq4uxn_jQzH75PHJzgYa4?usp=sharing

Or here:

#Extração com snscrape
import snscrape.modules.twitter as sntwitter
import csv

csvFile = open('place_result4.csv', 'a', newline='', encoding='utf8')
csvWriter = csv.writer(csvFile)
csvWriter.writerow(['id','date','tweet',]) 
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:@cafecomferri + since:2020-02-01 until:2020-11-05').get_items()):
    csvWriter.writerow([tweet.id, tweet.date, tweet.content])
csvFile.close()

#Com os dados já extraídos e prontos no ambiente de trabalho, começar pela importação das bibliotecas.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import nltk
#Todas as bibliotecas estão importadas e o Count Vectorizer também. 

dataset = pd.read_csv("place_result4.csv")
#Nesse ponto, a importação do dataset precisa estar de acordo com o formato do arquivo csv extraído pelo snscrape

print(dataset)
#O dataset importado possui duas colunas das quais em uma aparece o ID e na outra aparece o Tweet

vectorizer = CountVectorizer()
#Atribuímos o countVectorizer à uma variável. Instanciando-a

vectorizer.fit(dataset["tweet"])
#O Vectorizer atribuido ao dataset faz uma limpa no vocabulário e conta a frequência de cada palavra.

print(vectorizer.vocabulary_)
#Nesse ponto, o vectorizer pegou os dados apenas da primeira linha. Provavelmente preciso fazer um Row pra cada linha


#Falta agora  limpar esses dados (o que eu não sei fazer) e construir um gráfico em função do tempo(fornecido pelo snscrape no arquivo csv).


Please help me, this is fundamental for me to present my IC next February 24.

  • 1

    Hello. You tried and are giving a specific error that needs help (this would be within the scope of the site) or just want to know some way to do (I believe there would be)?

  • Oops! Yes, I wanted some way to do it. This is not in the scope of the site?

  • 1

    Not because it would be a customized solution for your scenario, and not a one-off programming solution that serves to help other people in various scenarios, which is the purpose of the site. But if you identify a way to do it and you have more of a scheduling question, you can come and ask us.

  • Also enjoy and make the [tour]

  • Okay, thank you. Do you know where I can get help to make a personalized response?

  • Perhaps in some other online Python community, data science or programming in general, that allows more general doubts. Some may speak Portuguese. For your specific problem you need to be aware of object orientation in Python (if you no longer have it, but it is a broad concept) in order to understand in general the documentation of the libraries you will need and the way to program, such as that of Countvectorizer (which must have some feature of omitting certain words from the set worked) and the one that generates graphics. Also searching for examples of their use can help.

Show 1 more comment
No answers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.