How to search for a word in a file and pull the phrase where it is together with the result?

Asked

Viewed 1,097 times

2

I need to search for words in a book archive (vidas secas.txt). But I want her to bring the phrase where she is. There’s a way to do that?

Here is my program:

arquivo =open('Vidas Secas.txt','r')
quantidade = 0
palavras =[]
y = 0
for linha in arquivo :
    linha = linha.rstrip()
    linha = linha.lstrip()
    #print('Linha original ', linha)
    p = linha.split(' ')
    #print('linha processada ', p)
    palavras = palavras + p

#print(palavras)

arquivo.close()
x = input ('Digite a palavra desejada: ')
for palavra in palavras:
    if x == palavra:
        quantidade = quantidade + 1
    if palavra[0:len(x)] == x:
        quantidade = quantidade + 1
print(quantidade)

arquivo.close()
  • In order not to have to read the whole file all the time, you would have to save all the sentences - from the beginning to the end of each one - and all the words contained in that sentence. Many words would be repeated between sentences. So when the user typed the word you would search every word of every sentence and show every sentence where that word is.

  • Could you explain it better, please? I don’t understand very well, and I’m a beginner in python, and I have to present a project tomorrow. My part is to make a program that by searching for a word the program pulls the fraze where the program is. I did so as a test: x='This is a beautiful house'. words=x.split(' ') How to test words[3]m (beautiful) i=3: j=i+1 (O j advances by the characters until you find one .) k=i-1 (k goes backwards until you find a capital letter) Then I did Len and then n=Len(words [j]) And words [j] [n+1] I stopped there, I don’t go further. Of the error message.

2 answers

1

You need to divide the text into sentences. The example below uses the endpoint . as a phrase delimiter:

with open('Vidas Secas.txt') as arquivo:
    texto = arquivo.read()

frases = [frase.strip() for frase in texto.split('.')]
pesquisa = input('Digite a palavra desejada: ')

for frase in frases:
    if palavra in frase.split():
        print(frase)

0


The ideal solution depends a lot on your definition of "phrase" and "word".

The simplest (and naive, and probably wrong) approach is to consider that each line is a sentence and all words are separated by spaces:

# palavra a ser buscada
busca = input('Digite a palavra desejada: ')

with open('Vidas secas.txt', 'r', encoding = 'utf-8') as arquivo:
    for linha in arquivo: # assume que cada linha do arquivo é uma frase
        # se a palavra está na linha, imprime a linha ("frase") inteira
        if busca in linha.strip().split():
            print(linha)

The method strip() removes spaces and line breaks at the end of the line, and split() separates the line into several parts, separated by spaces. The result is a list of words, and then just check if the word to be searched is in this list.

Another detail is the use of with, which automatically closes the file, without you having to worry about it. I also put the encoding as parameter, which you can change according to the encoding the file is in (see the documentation for more details).


But this approach, as I have already said, is oversimplified, and quite prone to errors. See for example this excerpt from the book Dry Lives:

- Come, excommunicated.

This sentence has 2 words: "Walk" and "excommunicated". But if I use split():

print('– Anda, excomungado.'.split())

The result is the following list:

['–', 'Anda,', 'excomungado.']

The hyphen was considered a word, and the other two words are "Anda," (comma at the end) and "excomungado." (with the dot at the end). That is, if you search for only "Walk" (without the comma), this line will not be found.

So the solution is to use some more sophisticated criteria, using regular expressions (regex), for example. For this you can use the module re. A first attempt would be to use \b (which marks the "frontier of word", that is, the beginning and end of a word) and \w+ (one or more alphanumeric characters) to represent the "word". (\w is not the best option, but for now we will use it to keep the expressions shorter - then I explain the problems of this option and give another alternative).

I also put \w+ in brackets to form a capture group, because then the word you find will be available by the method group, as the example below:

import re

regex_palavras = re.compile(r'\b(\w+)\b')
# palavra a ser buscada
busca = input('Digite a palavra desejada: ')

with open('Vidas secas.txt', 'r', encoding = 'utf-8') as arquivo:
    for linha in arquivo:  # assume que cada linha do arquivo é uma frase
        # buscar as palavras da linha, usando a regex
        for match in regex_palavras.finditer(linha.strip()):
            if busca == match.group(1): # group(1) contém a palavra
                print(linha)
                break

Now you can search by "Walk" (without comma), that the phrase "Walk, excommunicated" will be found, because the comma will be disregarded.

I also put a break to interrupt the for match ..., Because if the word occurs twice in the same sentence, I don’t need to print it twice. If I’ve already found the word, I print the phrase and no longer need to check the rest of it.


The problem with this approach is that \w also accepts digits * and the character _. That is to say, 123 and a_b will be considered words. Of course, if you have "absolute certainty" that these cases do not occur (or if it is only for an exercise, or any other reason that causes these cases not "need" to be validated), we could even stop here (see the alternative to \w further down).

* In Python 3, the digits accepted for \w are any character of the Unicode category "Number, Decimal Digit", which includes characters such as ٠١٢٣٤٥٦٧٨٩, among others (see this answer for more details). No Python 2, by default it only recognizes the digits from 0 to 9, and to recognize the other digits, it is necessary to enable the option UNICODE (example).

But there are other details you may need to take into account. For example, "Dragged on" has one or two words? The above code considers it to be two ("Dragged" and "if").

What about "wardrobe," do you consider one or two words? Maybe the hyphen should also be considered part of a word, as long as it has at least one letter before and after, right? But do not forget the apostrophe, as in "drop-of-water", which also has to be included in the list of "characters that form a word". If we use the above code, they will be considered more than one word ("wardrobe" and "clothing"; "drop", "d" and "water").

(Probably the book does not have the above words, but there may be other words composed and/or with apostrophes, and you should decide whether to consider them a single word or not).

A first attempt to cover these cases, considering that compound words count as just one word:

regex_palavras = re.compile(r"\b((?:\w+(?:'\w+)?)(?:-(?:\w+(?:'\w+)?))*)\b")

Basically, (?: defines a catch group: makes parentheses not available in the method group. This causes no groups to be created that I’m not interested in (I just need the first, which contains the whole word).

Then I use '\w+ to define "an apostrophe followed by one or more alphanumeric characters", and place everything in parentheses, followed by ?, which makes this section optional. I do the same thing for the hyphen, which may be followed by other alphanumeric characters (which may also have an apostrophe in the middle). Except that the hyphenate section has a * after, which means "zero or more occurrences", that is, may or may not have this whole stretch after the hyphen, and may have more than one hyphen (as in "cologne water").

With this, we consider simple and compound words, with or without apostrophe, as if they were one thing.

Despite this, this regex still has the problem of accepting digits and _, see here example of it working. If you want to be more specific in the characters that regex can have, just change \w for something like [a-záéíóúâêîôûãõç], which will only accept the letters "a" to "z", the accented characters and the cedilla. Put more characters inside the brackets if you need to (for Spanish texts, for example, you would need to put the ñ).

Another detail is that \w considers letters to be uppercase and lowercase, but the alternative [a-záéíóúâêîôûãõç] only consider lowercase. So be sure to set the option IGNORECASE so that capital letters are also considered:

regex_palavras = re.compile(.... , flags = re.IGNORECASE)

If you want, you can also change the + who is after the \w for some specific amount. + means "one or more occurrences", which means that articles "a" and "o" are also considered words. If you only want words with two letters or more, for example, change the + for {2,}. You can also put the maximum size if you want: {2,20}, for example, will accept between 2 and 20 characters.

Remember that these limits apply to each specific section. For example:

\w{2,20}(?:'\w{2,20})?

It means "between 2 and 20 alphanumeric characters", followed by an apostrophe, followed by more alphanumeric characters (between 2 and 20). That is, this word can have up to 40 characters (20 before the apostrophe and 20 after).

Finally, choose your definition of "word" and change the regex according to what you need.


Another point to consider is that there may be more than one sentence on the same line. We can consider that "sentences" are delimited by punctuation marks (end point, exclamation and interrogation), for example. Then it would be enough, for each line, to make this separation, before checking the words:

regex_frases = re.compile(r"[.!?]+")

with open('Vidas secas.txt', 'r', encoding = 'utf-8') as arquivo:
    for linha in arquivo:
        # uma linha pode ter várias frases
        for frase in regex_frases.split(linha.strip()):
            for match in regex_palavras.finditer(frase):
                if busca == match.group(1):
                    print(frase)
                    break

I used as a criterion for separating the sentences to regex [.!?]+: one or more occurrences of any of the punctuation marks (endpoint, exclamation or question mark). So are already included sentences that end with ... and ?!, for example.

But of course that doesn’t solve everything. In this passage:

Drought appeared to him as a necessary fact - and the obstinacy of the child irritated him.

Are 2 sentences ("The drought appeared to him as a necessary fact" and "and the child’s obstinacy irritated him.") or is it one? The above solution considers that it is only a single sentence.

If you want to consider that there are two, we cannot simply include the hyphen as a separator, because then "it appeared to you" will also be broken into two "sentences". We can then consider that the criterion is "punctuation marks or hyphenate with space before and after", for example:

regex_frases = re.compile(r"(?: \– )|[.!?]+")

with open('Vidas secas.txt', 'r', encoding = 'utf-8') as arquivo:
    for linha in arquivo:
        # uma linha pode ter várias frases
        for frase in regex_frases.split(linha.strip()):
            for match in regex_palavras.finditer(frase):
                if busca == match.group(1):
                    print(frase)
                    break

In this case, the previous section is separated into two sentences. If I search for "fact", for example, the result will be the phrase "The drought appeared to him as a necessary fact" (the content after the hyphen is considered another phrase).

Finally, the more different cases appear, the more complex the regex becomes. Decide what will be considered "phrase" and change the code as your decisions.


The previous solution assumes that there are no phrases that extend over more than one line. But it may happen (a sentence starts in one line and ends in another, will know who created this file), so the way is to read all the contents of the file in a single variable and then use the regular expressions in this content:

with open('Vidas secas.txt', 'r', encoding = 'utf-8') as arquivo:
    # todo o conteúdo do arquivo em uma única variável
    conteudo = arquivo.read()

# separar o conteúdo em frases
for frase in regex_frases.split(conteudo):
    for match in regex_palavras.finditer(frase):
        if busca == match.group(1):
            print(frase)
            break

There are other improvements, such as storing the phrase list in a variable, so you don’t have to read the file every time (assuming that multiple searches can be done at once):

# guardar a lista de frases, assim não preciso ler o arquivo de novo
frases = regex_frases.split(conteudo)

for frase in frases:
   for match in ... etc

And if you want to search case insensitive (no difference between upper and lower case letters), you can use the method casefold(), which returns a version of the own string for this type of comparison. In this case, just change the if from the previous examples to:

if busca.casefold() == match.group(1).casefold():
    # palavra encontrada (sem diferenciar maiúsculas de minúsculas)

The method casefold() was introduced in Python 3.3. If your version is older, the alternative is to use the method lower().

  • Hello, thank you so much. I didn’t completely understand what you wanted to duizer, but vioyu ask my teacher. I am learning python from the beginning of the year, because I do technician for integrated computer to high school and I am the first year. Thank you! But, I copied your code, and when I run the file in cmd with python 3.7 the message: File "Activity 1.py", line 6 for sentences = regex_phrases.split(content) Syntaxerror: invalid syntax What I do?

  • @Arthurvidal I suggest that edit your question and add the full code that generates this message, because syntax error is usually pq missing some detail in the program (which is hard to see so with "loose" code in the comments)

  • @Arthurvidal I think I understand, you put for frases = regex_... but in fact, or you do frases = regex_... (and then do it for) or do it for frases in regex_...

  • I don’t understand, but where is to put this command exactly? on line 6, where is the for sentences = regex_phrases.split (content)?

  • My code looks like this: with open('Dry lives.txt', 'r', encoding = 'utf-8') the file: # all the contents of the file in a single variable contents = file.read() # separating the contents into sentences for sentences = regex_phrases.split(content) for phrase in sentences: for match in ... if search == match.group(1): print(phrase) break

  • @Arthurvidal That would be: https://ideone.com/1lUEeR

Show 1 more comment

Browser other questions tagged

You are not signed in. Login or sign up in order to post.