How to do data Mining in a txt file with re.finditer

Asked

Viewed 306 times

5

This code can tell me the location of the words batman and sei throughout the file txt:

import re
f = open('C:/pah.txt','r+')
text = f.read()    
words = ['batman','sei']
for x in words:
 for m in re.finditer(x,text):
  print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))

How do I get the returned result to include the phrase the found word is in?

The archive pah.txt is this:

look at the Batman. I’m Batman.I don’t know.But and what will be the reason so I know? I don’t know

And the intended result should be:

10-16 olha o batman.

2 answers

5

To get the full sentences, you can do so:

import re

f = open('C:/pah.txt','r+')
text = f.read() 

words = ['batman','sei']

for x in words:
    sentences = [sentence for sentence in re.split('\.|\?|!', text) if x in sentence]

    for sentence in sentences:
        print sentence

The output looks like this:

olha o batman
 eu sou batman
nao sei
eu sei
Mas e qual será a razão para eu saber
nÃO sei

(I don’t understand what the "10-16" positions of the example you passed mean.)

  • Assuming there are 3 blanks at the beginning of the file, 10-16 refers to the position of the word "Batman" relative to the sentence (or the entire string, not clear). i.e. b is in position 10 and n in position 15 (16-1).

  • It makes sense.. Maybe I have some spaces or linebreaks at the beginning of his file. I’ll adapt my code considering this.

  • Although yours looked much better, hahah...

  • Thanks for the answers. Yes, 10-16 refers to the position of the beginning and end of the word in relation to the whole string. Thank you again for your reply, Michael.

3


In this case, you need a regular expression that matches the entire sentence, not just the desired word. What is a phrase?

  • Something you don’t own ., ? nor ! and:
  • Something that ends with ., ? or !.

So the regular expression that looks for a phrase [any] is:

[^.!?]*[.!?]

And to find a phrase containing the word "Batman" you would use:

[^.!?]*?(batman)[^.!?]*[.!?]

The parentheses around "Batman" form a catch group - for you to know later in which part of the sentence the found word appeared. For this, just pass as parameters to start and end the number of the group that interests you (1)

for x in words:
    for m in re.finditer('[^.!?]*?(' + x + ')[^.!?]*[.!?]', text):
        print '%02d-%02d: %s' % (m.start(1), m.end(1), m.group(0))

Exit:

07-13: olha o batman.
22-28:  eu sou batman.
33-36: nao sei.
40-43: eu sei.

Note: if what you want is the initial and final position of the word in relation to the sentence (and not in relation to the whole string) so just subtract from the position of the capture group the position of the whole match:

        print '%02d-%02d: %s' % (m.start(1)-m.start(), m.end(1)-m.start(), m.group(0))

Exit:

07-13: olha o batman.
08-14:  eu sou batman.
04-07: nao sei.
03-06: eu sei.
  • Perfect, I finally understood why to use [^.!?]*[.!?]

Browser other questions tagged

You are not signed in. Login or sign up in order to post.