Problem with Extractall not extracting exact occurrence

Question

Problem with Extractall not extracting exact occurrence

Asked 4 years, 10 months ago

Viewed 46 times

1

I have the following Dataframe:

df = pd.DataFrame({'Texto' : ['é importante o sucesso', 'o dia está lindo']})

I have two names . txt palavras_positives.txt and palavras_neagtivas.txt. These two frames are opened and transformed into a list of strings as follows:

positive = pd.read_csv('palavras_positivas.txt', encoding='utf-8')
negative = pd.read_csv('palavras_negativas.txt', encoding='utf-8')

The . txt files are formed by a single column and each row has a single word. Both Dataframes are then transformed into string and used as in a search:

lista_positiva = positive['Positivas'].to_list()
pattern_positivo = '|'.join(lista_positiva)

The same process is effect for negative words.

So I use the pattern_positive and the pattern_negative in the following code:

df3['Positivos']=df3['Texto'].str.extractall(({pattern_positivo})").groupby(level=0).agg(','.join)

df3['Negativos'] = df3['Texto'].str.extractall(f"({pattern_negtivo})").groupby(level=0).agg(','.join)

My problem is this:

in the text: " success is important ", I can get the words "success" and "important" because they’re on the positive list. In the negative column I am extracting the word "impose" because this word exists in the negative list, so for some reason the code reads "important" and returns "impose".

My goal is to extract the exact occurrence in both extractall.

Can someone help me?

Good evening! I could ask the question the format of your txt?

– lmonferrari

2020/09/29 at 01:22
Txts are single words arranged in a single column.

– StatsPy

2020/09/29 at 01:27
Which delimiter?

– Ewerton Belo

2020/09/29 at 13:42
After transforming txt’s to string the delimiter is pipe '|'. Join(list_positive)

– StatsPy

2020/09/29 at 13:47
Are you using regex? Which one?

– Ewerton Belo

2020/09/29 at 13:48
In the case in question it is returning what it has found, it is not explicit that it should return only if it is an equal whole word.

– Ewerton Belo

2020/09/29 at 13:50
It is. I need to change the code to return the exact occurrence, comparing text with the list. Here I need help.

– StatsPy

2020/09/29 at 13:54

Show 2 more comments

1 answer

Browser other questions tagged python ipython-notebook

You are not signed in. Login or sign up in order to post.

by Ewerton Belo • **392** points · Answer 1 · 2020-09-29T16:34:20+00:00

So what you want is to compare the words, try it like this:

import pandas as pd


data = {'Frases': ['Bom dia', 'Ok, vai lá', 'o sucesso é importante']}
df = pd.DataFrame(data)

negativ = ['mal', 'impor']
positiv = ['bom', 'ok', 'importante', 'sucesso']


def getByWordlist(row, wordlist):
    results = []
    for word in row['Frases'].split(' '):
        word = word.lower().replace(',', '')
        if word in wordlist:
            results.append(word)
    return results

df['Positivas'] = df.apply(getByWordlist, args=(positiv,), axis=1)
df['Negativas'] = df.apply(getByWordlist, args=(negativ,), axis=1)

print(df)

output:

                   Frases              Positivas Negativas
0                 Bom dia                  [bom]        []
1              Ok, vai lá                   [ok]        []
2  o sucesso é importante  [sucesso, importante]        []