Print which (ais) words repeat in Python

Asked

Viewed 183 times

0

Hello, how could I know which or which words repeated?

For example, I have a.txt file, in which when I open it as . read() and it returns it to me...

Alface, Alecrim, Capim, Limão, Alface, Azeitona, Alface

And I want to know how many words start with the letter A, so I did

import re
arq = open('arquivo.txt', 'r', encoding="utf8")
manipulado = arq.read()
print(manipulado)
r1 = re.findall(r'A\w+', manipulado)
print(r1)

He returns to me

[Alface, Alecrim, Alface, Azeitona, Alface]

How could I make to print

[Alface, Alecrim, Alface, Azeitona, Alface]
Alface: 3
  • My list returns strings, in this other question the list is with integers, it would be the same process?

  • Yes, it’s the same thing. You may have to make a small modification or other but it should be simple. Test the code to see. I arrived at this question with a brief search on the site, but if you do a better search here you will probably find an even more similar. :-)

  • https://stackoverflow.com/questions/25798674/python-duplicate-words

2 answers

1

If I understand correctly what you want is to read.txt file and want to count how many words started with the letter 'a'.

For this you could use the method. readlines(), match a variable and for each line that is read, you use one of the functions of the strings that is to be used as a list

I think the code makes it easier to understand:

contador = 0 #Variável que utilizaremos para contar
with open('arquivo.txt','r',encoding='utf-8') as f:
    lines = f.readlines() 
    for line in lines: 
        if line[0].lower() in 'a': 
            #funcionalidade de lista em strings
            contador += 1
print(contado)

Do not forget to put the encoding when opening the file, otherwise it will use the default enconding and with is the accent is different

If you also want to count phrases that include accents, you can install the library

 from unidecode import unidecode

 lines=[]
 contador = 0
 with open('arquivo.txt','r',encoding='utf-8') as f:
    for c in f.readlines():
       lines.append(unidecode(c)) 
    for line in lines:
       if line[0].lower() in 'a':
           contador += 1
 print(contador)

I managed to do so because I do not know method to take accentuation from list and this was the easiest way for me

If you’re having trouble installing the unidecode library, go to the terminal, access the scripts directory (interpreter for your IDE) and put: Pip install unidecode

For example with pycharm in windows:

cd C:\Users\<nome do usuário que está usando>\<pasta onde guarda seus programas>\venv\Scripts && pip install unidecode

Now if you wish to show the word itself that has repeated, you can do so:

   repetido =[]
   with open('arquivo.txt','r') as final:
      ultimo = final.readlines()
      if ultimo[len(ultimo)-1] != '.':
      with open('bans.txt','a') as u:
         u.write('\n.')

   with open('arquivo.txt','r',encoding='utf-8') as f:
      lines = f.readlines()

   p = []
   for line in lines:
      t = 0
      repetido = [f'{line}']
      for c,v in enumerate(lines):
         repetido[0]
         if v == repetido[0]:
            if t!=0:
               if c not in p:
                  print(v,c)
                  p.append(c)
            else:
               t+=1

I hope I’ve helped!

If you have any questions about a function or do not understand my procedure, just send a comment and I will try to answer it as soon as possible

NOTE:This program differentiates words with uppercase and lowercase letters, with or without accentuation and also works better being everything in topics

0


  • It is advisable to use the instruction with to upload files
  • Lists have the method count who does exactly what you need

The code would look like this:

import re

with open('arquivo.txt', encoding='utf8') as f:
    conteudo = f.read()

print(conteudo)

r1 = re.findall(r'A\w+', conteudo)
qtde_alface = r1.count('Alface')
print('Alface:', qtde_alface)

Relevant documentation:


In comment you wrote:

But if I didn’t know which ones to repeat, this was an example that I created but if it was a very long text, it would have to be able to do it automatically or it would be something extremely complex?

It is not quite clear what you want to do. If the intention is to count all words starting with the letter "a", you can use the class Counter.

import re
from collections import Counter

with open('arquivo.txt', encoding='utf8') as f:
    conteudo = f.read()

print(conteudo)

r1 = re.findall(r'A\w+', conteudo)

contador = Counter(r1)

for palavra, quantidade in contador.items():
    print(f'A palavra {palavra} se repete {quantidade} vez(es)')

If you only want to find out which is the most repeated word with the letter "a", use the method most_common of Counter.

import re
from collections import Counter

with open('arquivo.txt', encoding='utf8') as f:
    conteudo = f.read()

print(conteudo)

r1 = re.findall(r'A\w+', conteudo)

contador = Counter(r1)

palavra, quantidade = contador.most_common(1)[0]

print(f'A palavra que mais se repete é {palavra}, aparecendo {quantidade} vez(es)')
  • But if I didn’t know which ones to repeat, this was an example that I created but if it was a very long text, it would have to be able to do it automatically or it would be something extremely complex?

  • @Lexusrx I edited the answer with more information. It’s not very clear what exactly you want to do. Try to define precisely what the purpose of your program is.

  • Thanks, I’m just studying is not a practical program. But one question only, what is the meaning of 1 and 0 in "counter.most_common(1)[0]" and "f" within the last print.

  • @Lexusrx The method most_common receives by parameter the amount of words that will return. As I only wanted the most repeated, I passed 1 as a parameter (if I wanted the 3 most repeated, I would pass 3). This method returns a list, which in this case only has one element (because I only asked for 1), so I used the [0] to get the first element of this list (which, as I said, has only one element).

  • Got it, thank you very much, I’m going to study these methods now.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.