Create a frequency table based on another column in Python

Asked

Viewed 1,295 times

1

Good afternoon, you guys.

I have a data set in a file. csv, containing two columns: tweets and rating, where 'tweets' corresponds to any tweet searched on twitter and 'rating' corresponds to 'positive' or 'negative'.

I then wish to make a table of frequency, word for word, in which each row contains an unrepeated word and the classification of this word in the sentence.

Well, the numpy or nltk has some function that does this?

I’m trying to make two loops, one to go through the lines and the other to go through word to word, but I’m not sure which data structure to use to make this frequency table or how the algorithm would look.

So far I have it:

    import nltk
import pandas as pd
from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation
from nltk.probability import FreqDist
import numpy as np

#lê o arquivo
dataset = pd.read_csv('tweets.csv')

#busca as stop_words em português e remove a palavra 'não' para não causar contradições
stopwords = set(stopwords.words('portuguese') + list(punctuation))
stopwords = {x for i,x in enumerate(stopwords) if x != 'não'}

#busca o que são tweets e o que são classes
tweets = dataset['Text'].values
classes = dataset['Classificacao'].values

for tweet in tweets:
    for palavra in tweet:
        print(palavra)

The way it is, it was for the algorithm to print word by word, but it’s printing letter by letter, and I’m not getting why.

I know it’s not what I want yet, but it’s the beginning.

Any help would be welcome, thank you.

  • Your "frequency table" will be calculated individually for each tweet or it would be a single table for all tweets ?

  • A single table for all tweets, containing all the words of all the tweets together.

2 answers

1

Complementing the @Lacobus response, to know the classification of each word, you can separate the positives and negatives as follows:

import csv
import string
from collections import Counter

palavras = []
positivo = []
negativo = []

with open('tweets.csv' ) as arqcsv:
    leitor = csv.reader( arqcsv, delimiter=';')
    for linha in leitor:
        plinha = [palavra.strip( string.punctuation ) for palavra in linha[0].lower().split()]
        palavras += plinha
        if(linha[1].lower() == 'positivo'):
            positivo += plinha
        else:
            negativo += plinha

cntPalavras = Counter(palavras)
cntPositivo = Counter(positivo)
cntNegativo = Counter(negativo)


for palavra, frequencia in sorted(cntPalavras.items(), key=lambda i: i[1], reverse=True):
    pos = cntPositivo[palavra]
    neg = cntNegativo[palavra]
    print( '{} : [ f: {}, p: {}, n: {} ]'.format(palavra,frequencia, pos, neg) )

Using the same test csv file, will result in the following output:

nec: [f: 4, p: 4, n: 0]
sed: [f: 4, p: 3, n: 1]
sit: [f: 3, p: 3, n: 0]
amet: [f: 3, p: 3, n: 0]
mauris: [f: 3, p: 1, n: 2]
vel: [f: 3, p: 1, n: 2]
dolor: [f: 2, p: 2, n: 0]
elit: [f: 2, p: 2, n: 0]
odio: [f: 2, p: 2, n: 0]
rutrum: [f: 2, p: 2, n: 0]
facilisis: [f: 2, p: 1, n: 1]
convallis: [f: 2, p: 2, n: 0]
luctus: [f: 2, p: 2, n: 0]
purus: [f: 2, p: 2, n: 0]
interdum: [f: 2, p: 2, n: 0]
id: [f: 2, p: 2, n: 0]
malesuada: [f: 2, p: 2, n: 0]
in: [f: 2, p: 0, n: 2]
faucibus: [f: 2, p: 1, n: 1]
et: [f: 2, p: 1, n: 1]
maximus: [f: 2, p: 0, n: 2]
justo: [f: 2, p: 1, n: 1]
morbi: [f: 2, p: 1, n: 1]
enim: [f: 2, p: 2, n: 0]
tristique: [f: 2, p: 2, n: 0]
felis: [f: 2, p: 1, n: 1]
risus: [f: 2, p: 1, n: 1]
etiam: [f: 2, p: 0, n: 2]
vitae: [f: 2, p: 1, n: 1]
pharetra: [f: 2, p: 0, n: 2]
lorem: [f: 1, p: 1, n: 0]
ipsum: [f: 1, p: 1, n: 0]
consectetur: [f: 1, p: 1, n: 0]
adipiscing: [f: 1, p: 1, n: 0]
pellentesque: [f: 1, p: 1, n: 0]
scelerisque: [f: 1, p: 1, n: 0]
nunc: [f: 1, p: 1, n: 0]
maecenas: [f: 1, p: 1, n: 0]
venenatis: [f: 1, p: 1, n: 0]
nulla: [f: 1, p: 1, n: 0]
elementum: [f: 1, p: 1, n: 0]
est: [f: 1, p: 1, n: 0]
vivamus: [f: 1, p: 0, n: 1]
non: [f: 1, p: 0, n: 1]
nullam: [f: 1, p: 0, n: 1]
lacinia: [f: 1, p: 0, n: 1]
massa: [f: 1, p: 0, n: 1]
libero: [f: 1, p: 0, n: 1]
vulputate: [f: 1, p: 0, n: 1]
nisi: [f: 1, p: 0, n: 1]
suscipit: [f: 1, p: 0, n: 1]
consequat: [f: 1, p: 0, n: 1]
neque: [f: 1, p: 1, n: 0]
semper: [f: 1, p: 1, n: 0]
ante: [f: 1, p: 1, n: 0]
aliquam: [f: 1, p: 1, n: 0]
egestas: [f: 1, p: 1, n: 0]
integer: [f: 1, p: 1, n: 0]
eget: [f: 1, p: 1, n: 0]
efficitur: [f: 1, p: 1, n: 0]
accumsan: [f: 1, p: 1, n: 0]
quis: [f: 1, p: 1, n: 0]
tempor: [f: 1, p: 1, n: 0]
ut: [f: 1, p: 1, n: 0]
magna: [f: 1, p: 0, n: 1]
augue: [f: 1, p: 0, n: 1]
quisque: [f: 1, p: 1, n: 0]
blandit: [f: 1, p: 1, n: 0]
sollicitudin: [f: 1, p: 1, n: 0]
rhoncus: [f: 1, p: 1, n: 0]
lectus: [f: 1, p: 1, n: 0]
congue: [f: 1, p: 1, n: 0]
lacus: [f: 1, p: 1, n: 0]
donec: [f: 1, p: 1, n: 0]
leo: [f: 1, p: 1, n: 0]
gravida: [f: 1, p: 1, n: 0]
tortor: [f: 1, p: 1, n: 0]
ex: [f: 1, p: 0, n: 1]
tellus: [f: 1, p: 0, n: 1]
orci: [f: 1, p: 1, n: 0]
varius: [f: 1, p: 1, n: 0]
natoque: [f: 1, p: 1, n: 0]
penatibus: [f: 1, p: 1, n: 0]
magnis: [f: 1, p: 1, n: 0]
dis: [f: 1, p: 1, n: 0]
parturient: [f: 1, p: 1, n: 0]
montes: [f: 1, p: 1, n: 0]
nascetur: [f: 1, p: 0, n: 1]
ridiculus: [f: 1, p: 0, n: 1]
mus: [f: 1, p: 0, n: 1]
at: [f: 1, p: 0, n: 1]
porta: [f: 1, p: 0, n: 1]

1

This table you intend to calculate is called Histogram.

Follows a code capable of calculating a Histogram from a file .CSV:

import csv
import string
from collections import Counter

palavras = []

with open('tweets.csv' ) as arqcsv:
    leitor = csv.reader( arqcsv, delimiter=';')
    for linha in leitor:
        palavras += [ palavra.strip( string.punctuation ) for palavra in linha[0].lower().split() ]

cnt = Counter( palavras )

for palavra, frequencia in sorted(cnt.items(), key=lambda i: i[1], reverse=True):
    print( '{} : {}'.format(palavra,frequencia) )

Test file (tweets.csv):

Lorem ipsum dolor sit amet, consectetur adipiscing elit.;Positivo
Pellentesque scelerisque odio rutrum nunc facilisis convallis.;Positivo
Maecenas luctus luctus purus interdum venenatis.;Positivo
Nulla elementum id purus nec interdum.;Positivo
Sed malesuada nec est id convallis.;Positivo
Vivamus non facilisis mauris.;Negativo
Nullam lacinia massa libero, in vulputate nisi faucibus et.;Negativo
Mauris maximus justo vel suscipit consequat.;Negativo
Morbi sit amet neque rutrum, semper ante aliquam, egestas enim.;Positivo
Integer eget mauris faucibus, efficitur odio nec, accumsan justo.;Positivo
Sed tristique felis risus, quis tristique dolor tempor ut.;Positivo
Etiam vel magna augue.;Negativo
Quisque blandit, elit nec sollicitudin rhoncus, lectus congue lacus.;Positivo
Donec sit amet enim vel leo gravida malesuada vitae sed tortor.;Positivo
Morbi in maximus ex, vitae pharetra tellus.;Negativo
Orci varius natoque penatibus et magnis dis parturient montes.;Positivo
Nascetur ridiculus mus. Etiam at felis pharetra, porta risus sed.;Negativo

Exit:

nec : 4
sed : 4
mauris : 3
vel : 3
sit : 3
amet : 3
risus : 2
interdum : 2
justo : 2
purus : 2
in : 2
dolor : 2
et : 2
etiam : 2
id : 2
felis : 2
facilisis : 2
pharetra : 2
rutrum : 2
elit : 2
tristique : 2
vitae : 2
malesuada : 2
maximus : 2
faucibus : 2
morbi : 2
enim : 2
odio : 2
convallis : 2
luctus : 2
ipsum : 1
leo : 1
efficitur : 1
augue : 1
vivamus : 1
orci : 1
maecenas : 1
ut : 1
donec : 1
semper : 1
nunc : 1
ante : 1
ex : 1
tellus : 1
egestas : 1
massa : 1
aliquam : 1
gravida : 1
porta : 1
magna : 1
pellentesque : 1
nulla : 1
quisque : 1
parturient : 1
mus : 1
rhoncus : 1
scelerisque : 1
consectetur : 1
sollicitudin : 1
at : 1
suscipit : 1
non : 1
blandit : 1
est : 1
accumsan : 1
nisi : 1
adipiscing : 1
magnis : 1
varius : 1
natoque : 1
consequat : 1
ridiculus : 1
eget : 1
elementum : 1
montes : 1
integer : 1
libero : 1
lacinia : 1
neque : 1
tempor : 1
nullam : 1
dis : 1
vulputate : 1
lectus : 1
nascetur : 1
venenatis : 1
tortor : 1
quis : 1
penatibus : 1
lorem : 1
lacus : 1
congue : 1
  • Thank you @Lacobus. From what I’ve seen, the output counts the words that are distinct. What I would like is for each distinct word to count the number of positive and negative times they appear in the dataset. Ex: lorem {positives: 2, negatives : 8}. Is there any way to do this? It’s still histogram?

Browser other questions tagged

You are not signed in. Login or sign up in order to post.