How to do data Mining in a txt file with re.finditer

Question

How to do data Mining in a txt file with re.finditer

Asked 10 years, 3 months ago

Viewed 306 times

5

This code can tell me the location of the words batman and sei throughout the file txt:

import re
f = open('C:/pah.txt','r+')
text = f.read()    
words = ['batman','sei']
for x in words:
 for m in re.finditer(x,text):
  print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))

How do I get the returned result to include the phrase the found word is in?

The archive pah.txt is this:

look at the Batman. I’m Batman.I don’t know.But and what will be the reason so I know? I don’t know

And the intended result should be:

10-16 olha o batman.

2 answers

3

In this case, you need a regular expression that matches the entire sentence, not just the desired word. What is a phrase?

Something you don’t own ., ? nor ! and:
Something that ends with ., ? or !.

So the regular expression that looks for a phrase [any] is:

[^.!?]*[.!?]

And to find a phrase containing the word "Batman" you would use:

[^.!?]*?(batman)[^.!?]*[.!?]

The parentheses around "Batman" form a catch group - for you to know later in which part of the sentence the found word appeared. For this, just pass as parameters to start and end the number of the group that interests you (1)

for x in words:
    for m in re.finditer('[^.!?]*?(' + x + ')[^.!?]*[.!?]', text):
        print '%02d-%02d: %s' % (m.start(1), m.end(1), m.group(0))

Exit:

07-13: olha o batman.
22-28:  eu sou batman.
33-36: nao sei.
40-43: eu sei.

Note: if what you want is the initial and final position of the word in relation to the sentence (and not in relation to the whole string) so just subtract from the position of the capture group the position of the whole match:

        print '%02d-%02d: %s' % (m.start(1)-m.start(), m.end(1)-m.start(), m.group(0))

Exit:

07-13: olha o batman.
08-14:  eu sou batman.
04-07: nao sei.
03-06: eu sei.

Perfect, I finally understood why to use [^.!?]*[.!?]

– João Pontes

2014/03/30 at 15:23

Browser other questions tagged python

You are not signed in. Login or sign up in order to post.

by Michael Siegwarth • **3,427** points · Answer 1 · 2014-03-28T20:58:22+00:00

To get the full sentences, you can do so:

import re

f = open('C:/pah.txt','r+')
text = f.read() 

words = ['batman','sei']

for x in words:
    sentences = [sentence for sentence in re.split('\.|\?|!', text) if x in sentence]

    for sentence in sentences:
        print sentence

The output looks like this:

olha o batman
 eu sou batman
nao sei
eu sei
Mas e qual será a razão para eu saber
nÃO sei

(I don’t understand what the "10-16" positions of the example you passed mean.)