Make different combinations of words within a sentence

Asked

Viewed 190 times

2

As an example I have the following sentence:

texto = O gato subiu no telhado de 10m, um homem jogou-lhe uma pedra e ele caiu de uma altura de 20m.

I want to extract the following information:

(O gato subiu 10m, O gato caiu 20m)

I tried to:

(gato).*(subiu|caiu).*(?=m)

And just returned to me

gato subiu 10m.

I can also use:

>>search_1=re.findall(re.compile('gato.*subiu.*(?=m)'),texto)

>>search_1=[gato subiu 10]

>>search_2=re.findall(re.compile('gato.*caiu.*(?=m)'),texto)

>>search_2=[gato caiu 20]

and then I put the two lists together.

But I still believe there must be a more optimized way to write this in just one line of code.

Obs: Sentences always respect that order [gato / palavra / número seguido de "m"]

  • 3

    But not in his sentence o gato caiu, has ele caiu. The expression must be able to understand that ele refers to gato?

  • If I had the word cat twice it would be easy, but it only comes once.

  • If I use '(o gato).+caiu.+(?=m)' returns what I need, but then I would have to do it for every occurrence, went up, fell... etc.

  • But first answer our question: how do you want it to return gato caiu where there is no such expression in the text? It should not return ele caiu 20m?

  • The only possibility I can see is https://ideone.com/rqHi3v. But if the need is to really return gato caiu, a little more code is needed.

  • @Andersoncarloswoss if I make a code for (o gato).+caiu.+(?=m)' he returns o gato caiu 20 m, I would have to make another code (o gato).+subiu.+(?=m) he rehotorin o gato subiu 10m then would have my result, but would have to do two searches and I wanted to know if there is an optimized way without I need to do another code .

Show 1 more comment

1 answer

2


Cannot be done with a single expression using the module re python
(although it is possible with the module regex, created by Matthew Barnett, using \G).

Addendum (Guilherme Lautert):

The reason why you can’t use the same regex for both cases, is that regex is used to find/replace. And you have a problem in this logic.

Look at the sentences you want :

  1. O gato subiu 10m
  2. O gato caiu 20m

You want to capture O gato Twice, he only shows up once. The other being played by "ele". See in regex101. That is to say O gato has already been captured so he is not captured again.

Use two expressions, one to marry the subject, one to marry the word and the number, starting the marriage at the end of the last.

Subject:

\b(gato|coelho)\b

Sentence:

[^\n.]*?\b(subiu|caiu)\b[^\n.,]*?(\d+m\b)
  • [^\n.]*? - denied list that matches any character outside new lines or dots (ie in the same sentence), with a non-greedy quantifier for the smallest marriage possible.
  • \b(subiu|caiu)\b - group 1, to keep the verb.
  • [^\n.,]*? - more characters, less new line, semicolons.
  • (\d+m\b) - group 2, to keep to the number followed by "m".


Code

import re

sujeito_re  = re.compile(r"\b(gato|coelho)\b", re.IGNORECASE)
sentenca_re = re.compile(r"[^\n.]*?\b(subiu|caiu)\b[^\n.,]*?(\d+m\b)", re.IGNORECASE)
resultado = ()

texto = "O gato subiu no telhado de 10m, um homem jogou-lhe uma pedra e ele caiu de uma altura de 20m."


for sujeito in sujeito_re.finditer(texto):
    pos = sujeito.end()
    while True:
        sentenca = sentenca_re.match(texto, pos)
        if not sentenca:
            break
        resultado += (sujeito.group(1) + " " + sentenca.group(1) + " " + sentenca.group(2),)
        pos = sentenca.end()

print (resultado)

Upshot:

('gato subiu 10m', 'gato caiu 20m')

You can test here: http://ideone.com/PuQPGH

  • I think that so far has been the closest. I have used: search1 = re.findall(re.compile(gato.*subiu).*(?=m)),texto) and search2 = re.findall(re.compile(gato.*caiu).*(?=m)),texto) and join the two lists in one. I thought also to walk through the text extracting with search/group the position of the word, if there is and check if there is still some other occurrence until the end of the text, it would consume fewer lines of code. Obs: text = "The cat climbed on the roof of 10m, a man threw him a stone and he fell from a height of 20m."

  • @Mueladavc The smaller number of lines of code does not mean more optimized. This code is more efficient than using 2 findall. As I replied, it is impossible to do with a single expression, unless you use the regex module.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.