Python - Count values within another count of values in a dataframe?

Question

Python - Count values within another count of values in a dataframe?

Asked 6 years, 6 months ago

Viewed 834 times

2

I have a df as follows and I need to make a top 10 of the most repeated words in problem and within these selected words do the same within the solution.

Obs: the dataframe has more than 20,000 lines, this is just a sample.

Code:

df.query((pd.Series('' .join(df['Problema']). split ()). value_counts ()[:10]).apply(lambda x: " ".join('' .join(df['Solucao']). split ()). value_counts () [:10]))

Attributeerror: str Object has no attribute 'value_counts'

Expected:
Return the 10 most frequent words in the column "Problem" and in this return identify which are the 110 most frequent words in the column "Solution"

TOP 10 Problems | TOP 10 Troubleshooting

      xlm  | [extrair, bug, c, d, e, f, g, g, i, j]
impressao  | [rede, parado, c, d, e, f, g, g, i, j]
        c  | [a, b, c, d, e, ..., j]
        d  | [a, b, c, d, e, ..., j]
        e  | [a, b, c, d, e, ..., j]
        f  | [a, b, c, d, e, ..., j]
        g  | [a, b, c, d, e, ..., j]
        h  | [a, b, c, d, e, ..., j]
        i  | [a, b, c, d, e, ..., j]
        j  | [a, b, c, d, e, ..., j]

1 answer

Browser other questions tagged python count value

You are not signed in. Login or sign up in order to post.

by AlexCiuffa • **2,402** points · Answer 1 · 2019-02-01T19:50:48+00:00

As each column has a list, the best solution I could think of was to make a dictionary of the problems and iterate the dataframe. Each problem has a dictionary with a counter of each solution that appears for that problem. Example:

dict_problema = {'problema1':{'solução1':2, 'solução2':1},
                 'problema2':{'solução2':2, 'solução3':1}}

dict_conta_problema = {'problema1':2,
                       'problema2':1}

Resalvas:

It’s not a performative way
There’s probably some python library I couldn’t remember (or don’t even know) that would make the job easier
Perhaps it would be better to use the pandas' own functions

dict_problema = {}
dict_conta_problema = {}

for index, row in df.iterrows(): #percorrer cada linha do dataframe

    for problema in row['Problema']: #para cada problema da lista de problemas daquela linha

        if problema not in dict_problema: #adiciona no dicionário de problemas caso não esteja
            dict_problema.update({problema:{}})
            dict_conta_problema.update({problema:0})

        dict_conta_problema[problema] +=1 #incrementa a contagem daquele problema

        for solucao in row['Solucao']: #agora percorremos cada solução da linha
            if solucao not in dict_problema[problema]: #adiciona no dicionário de soluções caso não esteja
                dict_problema[problema].update({solucao:0})

            dict_problema[problema][solucao] +=1 #incrementa a contagem daquele solução

Now we create another dataframe with the dictionaries we have:

from collections import Counter

problema = []
solucao = []

for key, value in Counter(dict_conta_problema).most_common(10):
    problema.append(key)
    solucao.append([key for key, values in Counter(dict_problema[key]).most_common(10)])

df_top = pd.DataFrame(data={'Top Problemas': problema, 'Top Solucoes': solucao})

Edit

It is possible to catch the Top 10 problems by just doing:

top_problemas = [key for key, val in Counter(reduce(lambda x,y: x+y, df['Problema'].values)).most_common(10)]

You could use this same logic to get the most recurring solutions if you start from a filtered dataframe to each more recurring problem.