Count how many times a word of a file appears in another file

Asked

Viewed 1,498 times

4

I would like to count how many times a word list (archive1) appears in another word list (archive2)

with open("corpus.txt", "r") as f1, open("lexico.txt", "r") as f2:
    file1 = f1.read()
    file2 = f2.read()

    corpus1 = file1.split(" ")

    for word in file2:
        print(word, corpus1.count(word))

corpus.txt file (archive2)

I am afraid to look for other options for the quality of light is quite recommendable very white light, but the duration of everything opposite I lasted less than months the two lamps and I put them in the assist light. The light I want is paler but strong enough to light the room.

lexico.txt file (archive1)

is

but

light

Upshot

is 2

0

m 0

to 2

s 0

0

l 0

u 0

z 0

0

  • 1

    In lexico.txt words are separated by " n" (line break), always? That is, one word per line?

  • Always separated by a line break! I tried to make a corpus2 = file2.split(" n") but it didn’t work.

1 answer

7


Here’s what you can do:

count = {}

with open('corpus.txt') as f1, open('lexico.txt') as f2:
    corpus = f1.read().split() # texto
    for word in f2: # palavras a quantificar no texto
        w_strp = word.strip() # retirar quebras de linha
        if w_strp != '' and w_strp not in count: # se ja a adicionamos nao vale a pena faze-lo outra vez
            count[w_strp] = corpus.count(w_strp)
print(count) # {'mas': 2, 'é': 2, 'luz': 4}

Or in this case:

count = {}

with open('corpus.txt') as f1, open('lexico.txt') as f2:
    corpus = f1.read().split()
    lexico = set(word.strip() for word in f2 if word.strip() != '') # set() para evitar palavras repetidas

count = {l_word: corpus.count(l_word) for l_word in lexico}
print(count) # {'mas': 2, 'é': 2, 'luz': 4}

If you’re sure not repeated words in lexico.txt, you can just:

...
lexico = [word.strip() for word in f2 if word.strip() != '']
...

Or even:

count = {}

with open('temp/corpus.txt') as f1, open('temp/lexico.txt') as f2:
    corpus = f1.read().split()
    count = {l_word: corpus.count(l_word) for l_word in (word.strip() for word in f2 if word.strip() != '')}

print(count) # {'mas': 2, 'é': 2, 'luz': 4}
  • Thanks Miguel!!! It works perfectly! Sorry to bother you, but why "not in Count"? We should not add all entries even if repeated to then count?

  • 1

    @pitanga because just count once the number of occurrences for each word. Remember that corpus.count(w_strp) will already count all occurrences the first time, we need to count them again to the same word if there are repeated words in lexico.txt. If you are sure that in lexico.txt there are no repeated words you can withdraw this condition by keeping only: if w_strp != '':

  • 1

    @pitanga, I noticed a bug in my code yesterday. It failed for example when corpus.txt there was for example: "but it is Adamastor", {'é': 1, 'mas': 2}, because he also told the "but" in "Adamastor". He adds split() in corpus = f1.read(), to divide corpus.txt by words as well. I edited the answer

Browser other questions tagged

You are not signed in. Login or sign up in order to post.