Problem with Extractall not extracting exact occurrence

Asked

Viewed 46 times

1

I have the following Dataframe:

df = pd.DataFrame({'Texto' : ['é importante o sucesso', 'o dia está lindo']})

I have two names . txt palavras_positives.txt and palavras_neagtivas.txt. These two frames are opened and transformed into a list of strings as follows:

positive = pd.read_csv('palavras_positivas.txt', encoding='utf-8')
negative = pd.read_csv('palavras_negativas.txt', encoding='utf-8')

The . txt files are formed by a single column and each row has a single word. Both Dataframes are then transformed into string and used as in a search:

lista_positiva = positive['Positivas'].to_list()
pattern_positivo = '|'.join(lista_positiva)

The same process is effect for negative words.

So I use the pattern_positive and the pattern_negative in the following code:

df3['Positivos']=df3['Texto'].str.extractall(({pattern_positivo})").groupby(level=0).agg(','.join)

df3['Negativos'] = df3['Texto'].str.extractall(f"({pattern_negtivo})").groupby(level=0).agg(','.join)

My problem is this:

in the text: " success is important ", I can get the words "success" and "important" because they’re on the positive list. In the negative column I am extracting the word "impose" because this word exists in the negative list, so for some reason the code reads "important" and returns "impose".

My goal is to extract the exact occurrence in both extractall.

Can someone help me?

  • Good evening! I could ask the question the format of your txt?

  • Txts are single words arranged in a single column.

  • Which delimiter?

  • After transforming txt’s to string the delimiter is pipe '|'. Join(list_positive)

  • Are you using regex? Which one?

  • In the case in question it is returning what it has found, it is not explicit that it should return only if it is an equal whole word.

  • It is. I need to change the code to return the exact occurrence, comparing text with the list. Here I need help.

Show 2 more comments

1 answer

3


So what you want is to compare the words, try it like this:

import pandas as pd


data = {'Frases': ['Bom dia', 'Ok, vai lá', 'o sucesso é importante']}
df = pd.DataFrame(data)

negativ = ['mal', 'impor']
positiv = ['bom', 'ok', 'importante', 'sucesso']


def getByWordlist(row, wordlist):
    results = []
    for word in row['Frases'].split(' '):
        word = word.lower().replace(',', '')
        if word in wordlist:
            results.append(word)
    return results

df['Positivas'] = df.apply(getByWordlist, args=(positiv,), axis=1)
df['Negativas'] = df.apply(getByWordlist, args=(negativ,), axis=1)

print(df)

output:

                   Frases              Positivas Negativas
0                 Bom dia                  [bom]        []
1              Ok, vai lá                   [ok]        []
2  o sucesso é importante  [sucesso, importante]        []
  • It worked perfectly! I just need to tweak it so [ ] doesn’t show up on the Dataframe, but it worked perfectly - Thank you very much.

  • 1

    Only makes one return ','.join(results) to return string in getByWordlist

Browser other questions tagged

You are not signed in. Login or sign up in order to post.