Count dataframe lines with string according to position in Python text

Asked

Viewed 88 times

0

I have a dataframe with a text column, as follows:

import pandas as pd

df = pd.DataFrame([["1", "texto com PALAVRA frAse PARAGRAFO", True, "foo"],
                   ["2", "texto com palavra Paragrafo", False, "foo"],
                   ["3", "texto com Frase paragrafo", True, "foo"],
                   ["4", "texto com FRASE", True, "foo"],
                   ["5", "texto com frase", True, "foo"],
                   ["6", "frase", True, "foo"],
                   ["7", "texto", False, "foo"],
                   ["8", "texto com paRAgrafo", False, "foo"]],
                  columns=["id", "texto", "col1", "col2"])

I want to count all the lines in which, for example, "phrase" occurs before "word" or "paragraph"; that is, considering that what appears "first" is more relevant in context, if they were sets, would disregard the "intersections" and would have:

  • lines with "word": 2

  • lines with "phrase": 4

  • lines with "paragraph": 1

You may have more text before --- for example, "text" and "com" do not matter/interfere ---; and of course, I need to disregard upper and lower case when counting.

With str.contains, does not take into account the "position":

df_filter = df[df['texto'].str.contains('frase',case=False)]
len(df_filter)

5

How then?

1 answer

0

Good from what I understand of your question you are wanting to check the order in which the words happen in the text, well I will consider that you have the words which you want to verify the occurrence.

made an example that can help you create the logic that best applies to your code:

frase = 'Frase para testar'

def verificar_qual_palavra_vem_antes(frase, p1, p2):

    frase = frase.lower()

    frase_partida = frase.split()

    posição_palavra_um = frase_partida.index(p1)

    posição_palavra_dois = frase_partida.index(p2)

    if posição_palavra_um < posição_palavra_dois:
        return 'a palavra 1 veio primeiro'

    else:
        return 'a palavra 2 veio primeiro'

Explaining the code

Its operation is very simple, a function was created that makes this verification of positions with the parameters Phrase - word 1 and word 2

the first thing the function does and use the built-in Python function lower() which consist of transforming the entire string into a low box

immediately after the use of the split to split all words into a list

and so finally check the address of each word on the list

then if the address of word 1 is less than the address of word 2 already has the order that occurs the words

Browser other questions tagged

You are not signed in. Login or sign up in order to post.