Remove score from a Python file

Asked

Viewed 1,546 times

2

I have this code:

dataClean = ''.join(data).lower()
dataClean = re.sub(r'["-,.:@#?!&$]', ' ', dataClean)
print(dataClean)

Where data is an array of a text file. My goal is to remove punctuation, such as exclamation marks, commas, and the others. The above code compiles but is not taking out quotes or indents. Does anyone know why?


And in the case of the words with dash, as "I told you"? Is there any exception so that this hyphen is not taken away?

  • 1

    Do you know what the code does? Your Regexp has not been written to remove single quotes or indents.

  • @walt057, what are all the punctuation marks you’re trying to remove? Detail everyone so they can see what you’ve done and help with the answer. I also suggest taking a closer look at Regex, because what you’ve written doesn’t seem close to what you want to achieve with the code

  • @fernandosavio Actually this regex removes the quote (which would be the single quote? ). Kind of "unintentionally", it’s true, but it removes :-) - I detail this in my answer down below...

  • 1

    @hkotsubo Cara, I’m blind. I can’t find simple quotes (or apostrophe) in regex. hahaha. Forget it, I read your answer now.

1 answer

2


Within brackets, the hyphen has special meaning: it serves to define character ranges, as an example [a-z], which means "a letter of a to z (lowercase)".

The detail is that these ranges are not limited to letters, you can put any characters you want. In your case, "-, inside the brackets is interpreted as "any character between " and ,", using as a basis the code points of Unicode (which in the case of these characters are the same as the values of ascii table).

That is to say, ["-,] will pick any character from this list: ", #, $, %, &, ', (, ), *, + and ,, look at:

# encontrar os caracteres da string que correspondem à regex ["-,]
for m in re.finditer('["-,]', 'abc"#$%&*+,()\'def'):
    print(m.group(), end=" ")

The output of this code is:

" # $ % & * + , ( ) '

That is, the expression you are using already replaces these characters, in addition to the others you have placed next: ., :, @, #, ?, !, & and $ (yes, some are redundant because they are already contemplated by the interval "-,).


Anyway, so that the hyphen does not mean "interval between what comes before and what comes after it" and is interpreted as the hyphen itself, just escape it with \. I mean, just put \- instead of simply -.

But since we are using ranges, why not take advantage and use one that already picks the characters you want? You could, for example, use [!-.:-@], containing 2 intervals:

  • !-. takes all characters between ! and .
  • :-@ gets everyone in between : and @

This already includes the hyphen, double and single quotes (" and ') and all the others you had put in your original expression. See again the ascii table to know all the characters that are considered.

The second interval takes some extra characters that you hadn’t previously placed (such as the = and >). If you want them not to be replaced, simply remove the intervals and place the characters you want, one by one. For example, to add the hyphen and single quotes in your original regex, do ["\'\-,.:@#?!&$].

Anyway, regardless of what you choose, regex now removes the hyphen and quotation marks:

s = 'a"-b\':!?@#.c'
print('antes ', s)
s = re.sub('[!-.:-@]', ' ', s)
print('depois', s)

Exit:

antes  a"-b':!?@#.c
depois a  b       c

In the above code, you are replacing the characters with a space. If you want to remove the characters, just replace them with '' (without the space between the quotation marks):

s = re.sub('[!-.:-@]', '', s)

Thus, the result becomes abc (without the spaces).


Do not replace the hyphen between words

In this case I would break in 2 steps: first I remove the hyphens that are not between words, and then I remove the other characters.

For the first step, I just want to replace the hyphen if it fulfills at least one of the two requirements below:

  • doesn’t have a letter before, or
  • there is no letter after

If none of these criteria is met, it means that it should not be removed. For this I will use lookaheads and lookbehinds negative, which are ways of making the regex "look at what you have before and after". The expression looks like this:

s = re.sub(r'(?<![a-z])-|-(?![a-z])', '', s, flags=re.IGNORECASE)

I’m using [a-z] to detect any letter from a to z. Then I check 2 conditions:

  • (?<![a-z])-: hyphen that does not have a letter before (the syntax (?<!...) checks if something is not before the current position), or
  • -(?![a-z]): hyphen that does not have a letter after (the syntax (?!...) checks if something is not after the current position)

I also use the option IGNORECASE so that regex also considers uppercase letters. With this, hyphens that have a letter before and after will not be replaced.

If you want to include accented characters, you can use for example [a-záéíóúâêôãõç] instead of [a-z] (include more characters in the brackets if you need to). Another option is to use \w, but the problem is \w also accepts numbers and the character _ (therefore, you decide whether it is a good option or not). Another detail is that in Python 2 the \w only takes accented characters if the option UNICODE is enabled, as long as in Python 3 it already picks accents by default. Finally, choose the one that best fits your use cases.

Then I do the substitution the same way I did before, but excluding the hyphen (because the ones that had to be replaced were already). The code would then look like this:

s = 'a"-b\':!?@#.c disse-lhe amá-la'
print('antes ', s)
# substituir hífens, desde que não estejam entre palavras
s = re.sub(r'(?<![a-záéíóúâêôãõç])-|-(?![a-záéíóúâêôãõç])', '', s, flags=re.IGNORECASE)
# substituir os caracteres especiais, exceto o hífen
s = re.sub('[!-,.:-@]', '', s)
print('depois', s)

The exit is:

antes  a"-b':!?@#.c disse-lhe amá-la
depois abc disse-lhe amá-la

Note that only the first hyphen was replaced, as it was not between two letters (although it had a b then before I had a "). And now the contents of the brackets have changed to [!-,.:-@], not to consider the hyphen (now is the interval !-,, the point (.) and the interval :-@).

Of course, if you want, you can put everything in a single regex:

s = re.sub(r'[!-,.:-@]|(?<![a-záéíóúâêôãõç])-|-(?![a-záéíóúâêôãõç])', '', s, flags=re.IGNORECASE)

I already think it gets complicated to understand and mainly keep in the future. But again, you decide which option to use.


Quotation marks

Your first version of regex should already remove the quotes. If you are not removing, I suspect that your strings do not have exactly the character ".

This is because " is not the only quote character that exists. In the Unicode categories "Punctuation, Open" and "Punctuation, Close" there are several other quotation marks, such as and the , in addition to other "quotes" not so "obvious", such as the and the .

If this is the case, simply add the respective characters inside the brackets.
For example, [!-,.:-@〝〞「《] would include all four quoted quote characters.

The same goes for hyphenating, because they also exist several different characters that are called hyphens. See if it’s not the case to include them too.


Module regex

If you want, you can install the module regex, that has support to some features that currently the module re no. One of them is the possibility to use Unicode properties in expressions, using the syntax \p:

import regex

s = 'a"-b\':!?@#.c disse-lhe 〝〞「《amá-la'
print('antes ', s)
s = regex.sub(r'[!-,.:-@\p{Ps}\p{Pe}]|(?<!\p{L})-|-(?!\p{L})', '', s)
print('depois', s)

The exit is:

antes  a"-b':!?@#.c disse-lhe 〝〞「《amá-la
depois abc disse-lhe amá-la

The expressions \p{Ps} and \p{Pe} means, respectively, any character of the categories Ps ("Punctuation, Open") and Pe ("Punctuation, Close"), which includes all the different quotation marks I have already quoted.

Notice I also changed the letters ([a-z...]) for \p{L}, which includes any letter defined by Unicode, both uppercase and lowercase (so I removed the flag IGNORECASE). This option may be too wide because it includes letters from other alphabets/languages (between at this link and then click on the categories that start with "L" to see all the contemplated characters - only in lower case, for example, are more than 2 thousand), so if you want to limit yourself to the Latin alphabet, you can use the expression [a-z...].

Unfortunately the \p is not yet supported by the native Python regex API, so installing this module is a good alternative to simplify your expressions. But it is up to you to use, it all depends on your use cases (if you do not need to include all quotation marks and will only work with the Latin alphabet, for example, you do not need to use \p).

Browser other questions tagged

You are not signed in. Login or sign up in order to post.