In this case, you need a regular expression that matches the entire sentence, not just the desired word. What is a phrase?
- Something you don’t own
.
, ?
nor !
and:
- Something that ends with
.
, ?
or !
.
So the regular expression that looks for a phrase [any] is:
[^.!?]*[.!?]
And to find a phrase containing the word "Batman" you would use:
[^.!?]*?(batman)[^.!?]*[.!?]
The parentheses around "Batman" form a catch group - for you to know later in which part of the sentence the found word appeared. For this, just pass as parameters to start
and end
the number of the group that interests you (1
)
for x in words:
for m in re.finditer('[^.!?]*?(' + x + ')[^.!?]*[.!?]', text):
print '%02d-%02d: %s' % (m.start(1), m.end(1), m.group(0))
Exit:
07-13: olha o batman.
22-28: eu sou batman.
33-36: nao sei.
40-43: eu sei.
Note: if what you want is the initial and final position of the word in relation to the sentence (and not in relation to the whole string) so just subtract from the position of the capture group the position of the whole match:
print '%02d-%02d: %s' % (m.start(1)-m.start(), m.end(1)-m.start(), m.group(0))
Exit:
07-13: olha o batman.
08-14: eu sou batman.
04-07: nao sei.
03-06: eu sei.
Assuming there are 3 blanks at the beginning of the file,
10-16
refers to the position of the word "Batman" relative to the sentence (or the entire string, not clear). i.e.b
is in position10
andn
in position15
(16-1
).– mgibsonbr
It makes sense.. Maybe I have some spaces or linebreaks at the beginning of his file. I’ll adapt my code considering this.
– Michael Siegwarth
Although yours looked much better, hahah...
– Michael Siegwarth
Thanks for the answers. Yes, 10-16 refers to the position of the beginning and end of the word in relation to the whole string. Thank you again for your reply, Michael.
– João Pontes