Regular expression case insensitive in a dataframe

Question

Regular expression case insensitive in a dataframe

Asked 5 years, 2 months ago

Viewed 149 times

2

I have a dataframe with some tweets that were collected according to the keyword.

How do for example to extract at once only the lines with #flamengo and all its variations, such as #Flamengo, #FLAMENGO, etc.?

I used:

data['text'].str.extract('(#flamengo)')

But returns only tweets written in lowercase.

1 answer

Browser other questions tagged python regex pandas

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2020-04-19T09:30:46+00:00

According to the documentation, it is possible to pass a second parameter to extract, containing flags which alter the behaviour of regular expression.

In this case, just use the flag re.I, which makes regex case insensitive (does not differentiate between upper and lower case):

import re

data['text'].str.extract('(#flamengo)', re.I)

It is also possible to use the modifier inline (?i) in the expression itself, which has the same effect as flag:

data['text'].str.extract('((?i)#flamengo)')

# ou
data['text'].str.extract('((?i:#flamengo))')

To another answer suggested using [f|F] to get both a lowercase and a uppercase "f". Only this expression also picks up the character |, see. If you’re going to follow that idea, then the right one would be [fF][lL].... But using the flags is simpler.