Regex - take only the first occurrence of a word in Python?

Asked

Viewed 248 times

-5

In a string, there are several occurrences of a word, but I want to take only the first occurrence of this, how to do?

Below comes 'first' twice, but I wish only the first.

re.findall(r'primeiro',' o primeiro o segundo primeiro novamente')

Thanks in advance.

  • Use re.search() instead of findall. See documentation here

  • 1

    It doesn’t make much sense, because if you are searching for "first", the result will be the word "first". It would make sense to search for if a regex is not a fixed word. What exactly do you want to do?

  • The question makes no sense, if you want to get the index of the first occurrence of a given word do not need regex, use the method str.find(). Example: print(' o primeiro o segundo primeiro novamente'.find('primeiro'))

  • Thanks @Paulomarques, the Ruan in the answer below complemented this issue well.

  • @hkotsubo the word "first" has to come in the search, but in the string has it twice, but I wish only the first occurrence of it, the Ruan in the answer below complemented well this grateful question.

  • @I beg you, pardon my ignorance in the subject and also in not being able to clarify the question better, the Ruan in the answer below complemented this question well, grateful.

  • 1

    But if you search for the regex "first", the return is the word "first", then in practice it would be enough to know if the word "first" is in the string:if 'primeiro' in texto, or something like that. Do you realize that using regex is kind of useless in this case? It’s as if I wanted to find the letter "a" in the word "banana", and want as a result the letter "a" itself. I don’t need regex to get it, I just need to know if it has the letter "a" in the word... The answers below are Overkill, are a cannon to kill fly and it is a pity that no one has even mentioned it...

  • got @hkotsubo, but in continuity of this research, from the first occurrence, I will need other details as the 30 characters after only the first occurrence and then within that range selected identify a pattern, I didn’t put all these details in the initial question because it would be very extensive.

  • Well, you saw what I mentioned in your other question? Here. Maybe it’ll help...

  • yes, I was already wearing something similar, grateful.

Show 5 more comments

3 answers

0

Due to the lack of clarity of the question, it can be interpreted in the following ways:

  • How to know if a word is present in a text?
  • How to get index of the first occurrence of a word in a text?
  • How to get the slice inside a string where the first occurrence of a word in a text is found?

Regardless of what the question actually is, no regex is used for any of the three cases. Regular expressions are an expensive and cumbersome processing feature and should only be used to find patterns of characters or bytes never to find a defined word because there are less costly features.

Don’t misunderstand, I love working with Regex but despite the nice name, regular expressions are a case of linear grammar within the Chomsky Hierarchy ie using regex you are loading a parser(deterministic finite automaton) into your program which means you are giving up memory resources and processing time. So use Regex yes, but when necessary for example:

  • find words in a text containing special spelling or abject.
  • find words in a text containing certain vowel encounters.
  • special alphanumeric sequences.
  • break text using variable separators.
  • separate text into lexical symbols.
  • anything that can be related to searching and searching based on character repetition patterns.

After the introduction we go to(s) question(s).

How to know if a word is present in a text?

To know if a word is present in a text use the python operator in. Operators in and not in salute. x in s returns True if x is contained in s and False otherwise. x not in s returns the denial of x in s.

s = ' o primeiro o segundo primeiro novamente'

print('primeiro' in s)                          #True
print('segundo' in s)                           #True
print('terceiro' in s)                          #False

How to get index of the first occurrence of a word in a text?

To get index of the first occurrence of a word in a text use the builtin method str.find().

str.find(sub[, start[, end]])
Returns the smallest index in the string where the substring sub is found inside the slice s[start:end]. Optional arguments such as start and end are interpreted as in slicing notation. Returns -1 if sub is not located.

s = ' o primeiro o segundo primeiro novamente'

print(s.find('primeiro'))                       #3
print(s.find('segundo'))                        #14
print(s.find('terceiro'))                       #-1

How to get the slice inside a string where the first occurrence of a word in a text is found?

To get slice inside a string where the first occurrence of a word in a text is found simply add index of the first occurrence of the word plus its length obtained with the builtin function Len(). Remembering that not always the length of a string is visual length of the same string

s = ' o primeiro o segundo primeiro novamente'

for p in ("primeiro", "segundo", "terceiro"):
  if p not in s:
    print(f"Palavra \"{p}\" não encontrada.")
  else:
    print((p, i:= s.find(p), len(p) + i))

#('primeiro', 3, 11)
#('segundo', 14, 21)
#Palavra "terceiro" não encontrada.

-3

Use re.search(), works a little differently, follows the example:

import re

print(re.findall(r'primeiro',' o primeiro o segundo primeiro novamente'))
print("\n")
print(re.search(r'primeiro',' o primeiro o segundo primeiro novamente'))

Returns:

['primeiro', 'primeiro']

<re.Match object; span=(3, 11), match='primeiro'>

-3


The methods re.findall and re.search will return occurrences in a list and a Match Object, respectively:

['primeiro', 'primeiro']
<re.Match object; span=(3, 11), match='primeiro'>

Match Object, result of the method re.search has a method span() that returns a tuple. Its items indicate the contents of the desired "substring" beginning and end within the initial string, respectively. You can access the contents of the desired string as in the example:

string = ' o primeiro o segundo primeiro novamente'
resultado = re.search(r'primeiro', string)
comeco, fim = resultado.span() # (3, 11)
print(string[comeco:fim]) # primeiro

If you prefer, the method Match.group can be used to return the first full occurrence of Match Object:

resultado.group() # primeiro
  • very interesting your answer. will use here. now I have to think how I can do this using the text contained in a column of a dataframe.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.