Regular expression case insensitive in a dataframe

Asked

Viewed 149 times

2

I have a dataframe with some tweets that were collected according to the keyword.

How do for example to extract at once only the lines with #flamengo and all its variations, such as #Flamengo, #FLAMENGO, etc.?

I used:

data['text'].str.extract('(#flamengo)')

But returns only tweets written in lowercase.

1 answer

1

According to the documentation, it is possible to pass a second parameter to extract, containing flags which alter the behaviour of regular expression.

In this case, just use the flag re.I, which makes regex case insensitive (does not differentiate between upper and lower case):

import re

data['text'].str.extract('(#flamengo)', re.I)

It is also possible to use the modifier inline (?i) in the expression itself, which has the same effect as flag:

data['text'].str.extract('((?i)#flamengo)')

# ou
data['text'].str.extract('((?i:#flamengo))')

To another answer suggested using [f|F] to get both a lowercase and a uppercase "f". Only this expression also picks up the character |, see. If you’re going to follow that idea, then the right one would be [fF][lL].... But using the flags is simpler.

  • 1

    the other answer is wrong. is a typical answer from someone who knows a hammer, and treats everything like a nail - (and not even the hammer knows it well, since you can put the case-insensitive flag, as you did)

  • 1

    if it was not possible to solve the problem with parameters only in the regular expression - let’s assume that you solve that you have to remove the accents too, and find both "tree" and "tree" - the correct is to create another column in the dataframe, using the "apply" menu that has the normalized text - (that is: removed accents, converted to lowercase, spaces and characters that are not interesting converted to "_") - and search in the other column.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.