Build list without repeated words from a file

Asked

Viewed 282 times

1

I’m trying to create a program that reads a text file .txt and turn each line into a list. Then, with these lists, the program generates a new list with the words of the previous lists, without repetitions of words and in alphabetical order.

arq = open(''arquivo.txt'')

count = 0

x = 0

for contador in arq:

         count = count + 1

else:
   
      while x<= count:
           
        linha = arq.readline()

        a = linha.split()
 
        x = x + 1

        print(a)

This is what I tried. But when I run the program, only four empty lists appear. The file I used has four lines.

2 answers

3

An alternative is to save the words in a set, which is a native structure that does not allow repetitions. So, just read the file, read the lines of it, separate the words of each line and go saving them in a set:

palavras = set()
with open('arquivo.txt') as arq:
    for linha in arq: # para cada linha do arquivo
        for palavra in linha.split(' '): # para cada palavra da linha
            palavras.add(palavra) # adiciona a palavra no set

palavras_em_ordem = sorted(palavras)
print(palavras_em_ordem)

When adding the word, the word itself set checks if it already exists, and will not add duplicate words. Then just use sorted to get the list of ordered words.

Notice that I opened the file inside a with, that ensures that the file is closed at the end, even in case of error while reading or processing lines. The for linha in arq makes it read one line at a time (use readlines, as suggested by another answer, loads all file contents to memory at once, which may consume resources unnecessarily if the file is too large).

It is also worth remembering that the solution of the other answer may be to make it much slower as the word list grows, since for each word a test is done to see if it is already in the list, and this test in lists is slower compared to sets (take the test here).


Finally, it is worth remembering that do the split by space is a "naive" solution to get the words. It was not clear what is in the file, but if you have a phrase like "Hello, okay?" the split(' ') will consider that Olá, and bem? are words (the comma and the question will be part of the "word", so they will be considered different words from Olá and bem). If you want to consider more complex cases and eliminate commas, punctuation marks, and also consider compound words (such as "hummingbird") or with apostrophe ("drop of water"), there are a few examples here, here and here.

Also no distinction is made between upper and lower case: oi and Oi are considered different words. If you want to consider that both are the same word, just change the line you add in the set for palavras.add(palavra.casefold()).

  • 1

    the observation of opening the file with the function with, was quite timely. Because every time we work with files, we must ensure their loading as well as their closure. Good observation. + 1.

2

If I understand your problem, here’s what you’re trying to do:

arquivo = open('myfile.txt', 'r') 
listaDeLinhas = arquivo.readlines() 

palavras = [] 
for linha in listaDeLinhas:
     conteudoLinha = linha.split(" ")
     for palavra in conteudoLinha:
         if palavra not in palavras:
             palavras.append(palavra)

something like that

  • 1

    Yes, that’s what it was. Looking at it seems so simple, but I’ve tried so many ways. My difficulty with programming is seeing the connection to the codes. I have an idea of how the program works, but when it comes to passing it to the language commands, it doesn’t work.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.