0
I have a dataframe with a text column, as follows:
import pandas as pd
df = pd.DataFrame([["1", "texto com PALAVRA frAse PARAGRAFO", True, "foo"],
["2", "texto com palavra Paragrafo", False, "foo"],
["3", "texto com Frase paragrafo", True, "foo"],
["4", "texto com FRASE", True, "foo"],
["5", "texto com frase", True, "foo"],
["6", "frase", True, "foo"],
["7", "texto", False, "foo"],
["8", "texto com paRAgrafo", False, "foo"]],
columns=["id", "texto", "col1", "col2"])
I want to count all the lines in which, for example, "phrase" occurs before "word" or "paragraph"; that is, considering that what appears "first" is more relevant in context, if they were sets, would disregard the "intersections" and would have:
lines with "word": 2
lines with "phrase": 4
lines with "paragraph": 1
You may have more text before --- for example, "text" and "com" do not matter/interfere ---; and of course, I need to disregard upper and lower case when counting.
With str.contains
, does not take into account the "position":
df_filter = df[df['texto'].str.contains('frase',case=False)]
len(df_filter)
5
How then?