Extracting words from a long text and creating statistics on them. What’s wrong?

Asked

Viewed 399 times

3

We have the book "Pride and Prejudice by Jane Austen" of the project Gutenberg:

http://www.gutenberg.org/ebooks/1342

The goal is to extract all the words of the text, creating statistics such as: frequency of each word, total characters in the text, average word size, average sentence size and a "top 10" of the longest words.

Looking at the text, I found that many words contain strange characters like:

"the,"

Requiring initial removal of these characters.

Like I tried to do:

# -*- coding: UTF-8 -*-
from string import punctuation
from collections import Counter
with open("1342-0.txt",encoding='utf8') as f:
    texto = f.read()

words = texto.split()
n_words =[]
for word in words:
    for p in punctuation:
        if p in word:
            n_word = word.replace(p,"")
            n_words.append(n_word)
        n_words.append(word)

"1342-0.txt" is the book in question. The above code tries to delete unwanted characters but does not work. What is wrong? Any better idea?

  • I don’t understand what’s so strange about "the,". Is the comma? If it is just remove all punctuation before processing the text.

  • @Augusto Vasques: that’s what I’m trying to do! Remove punctuation, accents... but it didn’t work -

2 answers

2


An initial idea is to do the split not only by spaces, but by any character that is not part of a word:

from collections import Counter
import re

r = re.compile(r'\W+')
c = Counter()
with open("1342-0.txt", encoding='utf8') as f:
    for linha in f:
        for word in r.split(linha):
            c.update([word])

print(c)

The shortcut \W is "all that nay for letter, number, or character _" - and how in the text it has "words" as _she_, That’s considered a different word than she. I also consider that the numbers (like 1) are "words", which are also accounted for.

As I find the words, I update the Counter using method update (if the key does not exist, it is created with the value 1, and if it exists, add 1 to its value - in the end we have the total count of each word).

Another detail is that read() loads the entire contents of the file into memory at once. Depending on the size of the file, this may be a problem. Already the above code reads one line at a time (and I am assuming that there is no case of a word starting on one line and ending on another - although in this case, use read and split would also not consider that it is the same word).

If you don’t want to include the _ as part of a word, simply change the regex to:

r = re.compile(r'[\W_]+')

The problem is that there are also words with hyphenation, such as over-scrupulous. The above code considers two different words ("over" and "scrupulous"). If you want them to be one word, you have to change a little:

from collections import Counter
import re

r = re.compile(r'\b\w+(?:-\w+)*\b')
c = Counter()
with open("1342-0.txt", encoding='utf8') as f:
    for linha in f:
        for word in r.findall(linha):
            c.update([word])

print(c)

Now I use \w+ (one or more characters that form a word), and I place a section containing a hyphen and \w+ (and this whole section can repeat itself zero or more times). So I get words with one or more hyphens too.

If you don’t want to include the _ as part of a word, use:

r = re.compile(r'\b[^\W_]+(?:-[^\W_]+)*\b')

It is worth remembering that string.punctuation only consider the characters !"#$%&'()*+,-./:;<=>?@[]^_{|}`. If you have any other character in the text other than letter, number or _, he will not be removed.

An example is the character (present in the text), which nay is the same thing as " (are different quotes, the first is the "LEFT DOUBLE QUOTATION MARK" and the second is "QUOTATION MARK", and if you use punctuation, will only remove the second).

  • Thank you! I would like to understand what I am doing wrong when it comes to removing unwanted characters from words. Any ideas?

  • @Laurindasouza I don’t quite understand your logic, but it seems to be because you do n_word = word.replace(p,"") (that is, the word without the punctuation is in n_word, but in the end you do n_words.append(word) (I mean, it adds the original word, with the scores). It’s a bit of a confusing algorithm, and I think it even has append plus, a simpler version of your loop would be: https://ideone.com/LnJf3v

  • if you can help https://answall.com/questions/443768/utilizando-recursos-de-program%C3%a3o-functional-to-remove-a-list-of-words-d

2

To remove punctuation characters from a python text just one line:

from string import punctuation

texto = '''It is a truth universally acknowledged, that a single man in
      possession of a good fortune, must be in want of a wife.

      However little known the feelings or views of such a man may be
      on his first entering a neighbourhood, this truth is so well
      fixed in the minds of the surrounding families, that he is
      considered the rightful property of some one or other of their
      daughters.      
'''

#Remove os pontuadores
print(texto.translate(str.maketrans('', '', punctuation)))

Resulting:

  It is a truth universally acknowledged that a single man in
  possession of a good fortune must be in want of a wife

  However little known the feelings or views of such a man may be
  on his first entering a neighbourhood this truth is so well
  fixed in the minds of the surrounding families that he is
  considered the rightful property of some one or other of their
  daughters     

Running on Repl.it: https://repl.it/repls/DarkcyanPointlessChord

Example 2: https://repl.it/repls/ResponsibleVariableTheories

The logic is the following, the method str.translate() returns a copy of the string in which each character was mapped through the conversion table specified by the method str.maketrans().

  • I didn’t know this Translate!

  • I did not want to leave as an answer because it hardly gives content so simple. I was going to leave as a comment on your question.

  • @Laurindasouza The problem is that punctuation only considers some ASCII characters, but in the text has for example the character , that nay is the same thing as " (the first is not removed if use punctuation, only the second is).

  • @hkotsubo just add the character in the translation table. There is no mystery.

  • Yes, Augusto, it is that there she will have to know all the non-ASCII characters that have in the text and include them in the list. Depending on the text, it can be more laborious...

  • @Augusto Vasques: will I have to use str.maketrans(', ', punctuation) for each character? I don’t quite understand... Here you are only removing the commas...

  • @Laurindasouza he took away everything that is set in string.punctuation. What we’re discussing is about the character which is not defined in punctuactor and I said, just concatenate the character to string.punctuation to be removed as well.\

  • @Augusto Vasques: got it! I’m a beginner in Programming! Thank you!

  • @Laurindasouza look at this other example https://repl.it/repls/ResponsibleVariableTheories which removes the characters and .This is what we were discussing, in case you want to remove something that is not defined in string.punctuation it is necessary to add and I defend that the addition is simple.

  • @Laurindasouza The addition is simple if you know which characters are present in the text. But if a new text with an unanticipated character appears, you will have to change the code and add it as well. My answer does not have this problem, it already eliminates everything that is not part of a word - all right that has other problems, such as the fact of using regex, which is not so simple and not so performatic :-)

  • @Augusto Vasques Why call str.maketrans and not text.maketrans ?

  • @Laurindasouza Read in the documentation that str.maketrans() is a static method and static methods should be called by the class and not by instances.

  • @Laurindasouza Already the str.Translate() is an instance method which means that it should be invoked by an instance texto.translate(....

Show 8 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.