Python - Count values within another count of values in a dataframe?

Asked

Viewed 834 times

2

I have a df as follows and I need to make a top 10 of the most repeated words in problem and within these selected words do the same within the solution.

Obs: the dataframe has more than 20,000 lines, this is just a sample.

inserir a descrição da imagem aqui

Code:

df.query((pd.Series('' .join(df['Problema']). split ()). value_counts ()[:10]).apply(lambda x: " ".join('' .join(df['Solucao']). split ()). value_counts () [:10]))

Attributeerror: str Object has no attribute 'value_counts'

Expected:
Return the 10 most frequent words in the column "Problem" and in this return identify which are the 110 most frequent words in the column "Solution"

TOP 10 Problems | TOP 10 Troubleshooting

      xlm  | [extrair, bug, c, d, e, f, g, g, i, j]
impressao  | [rede, parado, c, d, e, f, g, g, i, j]
        c  | [a, b, c, d, e, ..., j]
        d  | [a, b, c, d, e, ..., j]
        e  | [a, b, c, d, e, ..., j]
        f  | [a, b, c, d, e, ..., j]
        g  | [a, b, c, d, e, ..., j]
        h  | [a, b, c, d, e, ..., j]
        i  | [a, b, c, d, e, ..., j]
        j  | [a, b, c, d, e, ..., j]

1 answer

1

As each column has a list, the best solution I could think of was to make a dictionary of the problems and iterate the dataframe. Each problem has a dictionary with a counter of each solution that appears for that problem. Example:

dict_problema = {'problema1':{'solução1':2, 'solução2':1},
                 'problema2':{'solução2':2, 'solução3':1}}

dict_conta_problema = {'problema1':2,
                       'problema2':1}

Resalvas:

  • It’s not a performative way
  • There’s probably some python library I couldn’t remember (or don’t even know) that would make the job easier
  • Perhaps it would be better to use the pandas' own functions
dict_problema = {}
dict_conta_problema = {}

for index, row in df.iterrows(): #percorrer cada linha do dataframe

    for problema in row['Problema']: #para cada problema da lista de problemas daquela linha

        if problema not in dict_problema: #adiciona no dicionário de problemas caso não esteja
            dict_problema.update({problema:{}})
            dict_conta_problema.update({problema:0})

        dict_conta_problema[problema] +=1 #incrementa a contagem daquele problema

        for solucao in row['Solucao']: #agora percorremos cada solução da linha
            if solucao not in dict_problema[problema]: #adiciona no dicionário de soluções caso não esteja
                dict_problema[problema].update({solucao:0})

            dict_problema[problema][solucao] +=1 #incrementa a contagem daquele solução

Now we create another dataframe with the dictionaries we have:

from collections import Counter

problema = []
solucao = []

for key, value in Counter(dict_conta_problema).most_common(10):
    problema.append(key)
    solucao.append([key for key, values in Counter(dict_problema[key]).most_common(10)])

df_top = pd.DataFrame(data={'Top Problemas': problema, 'Top Solucoes': solucao})

Edit

It is possible to catch the Top 10 problems by just doing:

top_problemas = [key for key, val in Counter(reduce(lambda x,y: x+y, df['Problema'].values)).most_common(10)]

You could use this same logic to get the most recurring solutions if you start from a filtered dataframe to each more recurring problem.

  • Alex, thank you for answering me!! In the case of problems and solution I do not know the properties to mount the dictionary. I need the top 10 repeating problems, and when I do, I need to know which are the most repeating solutions. This is a dynamic and constantly changing data, for this reason I believe that it is not possible to create a dictionary, because it would need fixed words. In this case, I cannot feed a dictionary. Perchance, you know another solution?

  • It is very difficult to work with the df the way it is. I recommend doing a Pre-processing of the data to create the dataframe without lists. A suggestion is to use Isolated Problems as Index and each column as a solution. The values of df would be the number of occurrences of the solution in the problem.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.