Analyzing strings in a text file and returning the string that appeared most

Asked

Viewed 1,547 times

1

I need to analyze strings in a text file and return to the one that appears the most (if there is a tie take both) and save in another text file. I open the file and analyze the lines but I do not know how to check the one that appears more. Example:

file "A" - Input

1 #Brasilnacopa

2 #Operacaolavajato

3 #partiuad2

4 #partiuad2

5 #Operacaolavajato

6 #partiuad2

7 #Dietasegundafeira

"B" file - Output

1 #partiuad2

  • Welcome to [pt.so]. First, since you are a new user, do the [tour] to get a quick overview of the site. Then go to [Edit] and add the code you developed. It will be easier for us to indicate where it failed than to recreate a solution from scratch. Obviously other equivalent solutions may appear, but knowing where you went wrong is key to learning. To add the code, simply paste it into the question editor, select it and press the shortcut Ctrl+K to format it correctly. I already add that type list Python has a method called count.

2 answers

2


I have little information about your case so I will consider the following:

  1. File "A" has one word per line
  2. You need to find which word appears more often, but you don’t have a list of possible words, that is, you have to count the words that are inside the "A" file, whatever they are

Whereas File "A" has that content:

BrasilNaCopa
OperacaoLavaJato
PartiuAD2
PartiuAD2
OperacaoLavaJato
PartiuAD2
DietaSegundaFeira

What we’re gonna need to do is this:

# Abrimos o arquivo "arquivoa.txt" para leitura
arquivoA = open('arquivoa.txt', 'r') 

# Lemos o conteúdo do aquivo para a variável "texto"
# A variável "texto" é uma lista onde cada item é uma linha
texto = arquivoA.readlines()

# === OBSERVAÇÇÕES IMPORTANTES ===
# Dar um print na variável texto:
#   print(texto) 
#
# Resultaria em:
#   ['BrasilNaCopa\n', 'OperacaoLavaJato\n', 'PartiuAD2\n', 'PartiuAD2\n', 'OperacaoLavaJato\n', 'PartiuAD2\n', 'DietaSegundaFeira\n']
#
# Observe que há o "\n" (quebra de linha) no final de cada string, vamos ter 
# que limpar isso depois

# Criamos um dicionário para armazenar a contagem nas palavras
contagem = dict()

for linha in texto:
    # Limpamos aquela quebra de linha (\n) com o strip()
    palavra = linha.strip()

    if palavra not in contagem.keys():
        # Se a palavra ainda não existir na contagem, incluimos com o valor 1
        contagem[palavra] = 1
    else:
        # Se a palavra já existe na contagem a gente soma 1 no valor atual
        contagem[palavra] += 1

# Nesse ponto o dicionário "contagem" já tem a contagem de todas as palavras
# Dar um print em contagem:
#   print(contagem)
#
# Resultaria em:
#   {'DietaSegundaFeira': 1, 'PartiuAD2': 3, 'OperacaoLavaJato': 2, 'BrasilNaCopa': 1}

# Agora obtemos a palavra com maior contagem
palavraMaisRepetida = max(contagem)

# Dar um print em palavraMaisRepetida:
#   print(palavraMaisRepetida)
#
# Resultaria em:
#   'PartiuAD2'

IMPORTANT

Note that here I only approached treatment for cases where only one word appears at the top of the count. You said that in case of a tie at the top you should pick all the words that are at the top of the count.

I’ll leave this treatment for you to finish, I think you’ve got the spirit of it and now it’s easy.

You also need to record this count result in the File "B"

  • 1

    It’s nice that this answer was so helpful to you, if you can vote in favour of it, too. It costs nothing and still gives a little strength in my reputation. Hug.

0

It is simple :) But it is important that the words of both files are in separate lists! (one for each). That’s how you do it:

palavras = {}
for x in a:    # a e b sao as listas
contador = 1
for y in b:
    if x == y:
        contador +=1
palavras[x] = contador

Ready! Then you will have a dictionary whose keys are the words, and the values are the amount of times such a word repeats itself.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.