How to extract only words from any text, ignoring punctuation and uppercase letters?

Asked

Viewed 149 times

1

Be the text below:

This is the most favourable period for travelling in Russia.  They fly
quickly over the snow in their sledges; the motion is pleasant, and,
in my opinion, far more agreeable than that of an English stagecoach.
The cold is not excessive, if you are wrapped in furs--a dress which
I have already adopted, for there is a great difference between walking
the deck and remaining seated motionless for hours, when no exercise
prevents the blood from actually freezing in your veins.  I have no
ambition to lose my life on the post-road between St. Petersburgh
and Archangel. I shall depart for the latter town in a fortnight
or three weeks; and my intention is to hire a ship there, which can
easily be done by paying the insurance for the owner, and to engage
as many sailors as I think necessary among those who are accustomed
to the whale-fishing.  I do not intend to sail until the month of June;
and when shall I return?  Ah, dear sister, how can I answer this question?
If I succeed, many, many months, perhaps years, will pass before you
and I may meet.  If I fail, you will see me again soon, or never.
Farewell, my dear, excellent Margaret.  Heaven shower down blessings
on you, and save me, that I may again and again testify my gratitude
for all your love and kindness.

I would like to create a list of words from the text and display the frequency of each one. I found on the Internet the code below that works but did not understand:

def PegaPalavras(texto):

       return ''.join((c if c.isalnum() else ' ') for c in texto).split()

I tried to rewrite it more simply but I couldn’t:

def PegaPalavras(texto):
    palavras = []

    for c in texto:
        if c.isalnum():
           c=c
           palavras.append("".join(c).split())
        else:
           c =" "
           palavras.append("".join(c).split())


    return palavras

How to rewrite the first code more explicitly (without just one line), in order to facilitate its understanding? Some other solution?

2 answers

2

You can use the function re.finditer from the native Python regex module.

Using the pattern \w+ in Unicode strings you will match 1 or more characters that should be part of a Unicode word (this would be a-z, A-Z, 0-9, _, and the zillions of variations that must exist in Unicode). See documentation.

This way you can create an iterator that, as you read the string, will return word for word.

For example:

def get_palavras(string):
    yield from re.finditer(r'\w+', string)

A complete example would be:

import re


def get_palavras(string):
    yield from re.finditer(r'\w+', string)


texto = """This is the most favourable period for travelling in Russia.  They fly
quickly over the snow in their sledges; the motion is pleasant, and,
in my opinion, far more agreeable than that of an English stagecoach.
The cold is not excessive, if you are wrapped in furs--a dress which
I have already adopted, for there is a great difference between walking
the deck and remaining seated motionless for hours, when no exercise
prevents the blood from actually freezing in your veins.  I have no
ambition to lose my life on the post-road between St. Petersburgh
and Archangel. I shall depart for the latter town in a fortnight
or three weeks; and my intention is to hire a ship there, which can
easily be done by paying the insurance for the owner, and to engage
as many sailors as I think necessary among those who are accustomed
to the whale-fishing.  I do not intend to sail until the month of June;
and when shall I return?  Ah, dear sister, how can I answer this question?
If I succeed, many, many months, perhaps years, will pass before you
and I may meet.  If I fail, you will see me again soon, or never.
Farewell, my dear, excellent Margaret.  Heaven shower down blessings
on you, and save me, that I may again and again testify my gratitude
for all your love and kindness."""

for match in get_palavras(texto):
    print(f"texto[{match.start()}:{match.end()}]: {match[0]}")

And the exit:

texto[0:4]: This
texto[5:7]: is
texto[8:11]: the
texto[12:16]: most
texto[17:27]: favourable
...
texto[1247:1250]: all
texto[1251:1255]: your
texto[1256:1260]: love
texto[1261:1264]: and
texto[1265:1273]: kindness

See working on Repl.it

In the documentation of the Match-type object, you can see which methods and attributes you can read with each iteration (in addition to start, end and __getitem__ demonstrated in the example).

1

Take a look at this link:

# Open the file in read mode 
text = open("sample.txt", "r") 

# Create an empty dictionary 
d = dict() 

# Loop through each line of the file 
for line in text: 
    # Remove the leading spaces and newline character 
    line = line.strip() 

    # Convert the characters in line to  
    # lowercase to avoid case mismatch 
    line = line.lower() 

    # Split the line into words 
    words = line.split(" ") 

    # Iterate over each word in line 
    for word in words: 
        # Check if the word is already in dictionary 
        if word in d: 
            # Increment count of word by 1 
            d[word] = d[word] + 1
        else: 
           # Add the word to dictionary with count 1 
            d[word] = 1

# Print the contents of dictionary 
for key in list(d.keys()): 
    print(key, ":", d[key]) 

This code above is General and does not disregard scores.. To disregard the score, you can make a record for each of the scores, for example: replace word.("!", ""), or something more Generic as something of guy:

import re
string_nova = re.sub(u'[^a-zA-Z0-9áéíóúÁÉÍÓÚâêîôÂÊÎÔãõÃÕçÇ: ]', '', string_velha.decode('utf-8'))
  • Thank you! I could explain the code -> Return ''. Join((c if c.isalnum() Else ' ') for c in text). split()

  • your code does not ignore the score...

  • I haven’t been able to get a look at that code yet..

Browser other questions tagged

You are not signed in. Login or sign up in order to post.