How to create a Stopwords using R

Asked

Viewed 200 times

5

Hi,

I need to do a task and I’m not getting into a logical reasoning.

My scenario is: I have a DF with several columns, I need to "read column 3", identify the words and sort.

Example:

DF 

nome      rua    funcao
alberto   assis  programador
elisa     cons   enfermeira
pedro     assis  prog.

I want to "read column 3" and whenever I find "programmer|Prog" or similar, in a new column "classification" put "Python", the DF would look like this.

DF

nome      rua    funcao        classificacao
alberto   assis  programador   Python
elisa     cons   enfermeira    outros
pedro     assis  programador.  Python

Could someone tell me if creating a stopWords is the best way to solve ?

2 answers

6

To complement the @Thiago Fernandes response, you can find similar patterns using the function grep():

dataset[grep('prog', dataset$funcao), 'funcao']
# [1] programador prog.

The function grep() returns the position of the elements while grepl() returns a TRUE or FALSE:

dplyr::mutate(dataset, classificacao = ifelse(grepl('prog', dataset$funcao), "Python", "Outros"))

4


A way to do this, probably not the most efficient.

dataset = read.table(text = 'nome      rua    funcao
                             alberto   assis  programador
                             elisa     cons   enfermeira
                             pedro     assis  prog.', header = T)


palavras_similares = c("prog.", "Prog", "programador", "Programador", "programador.", "Programador.")

#Posição das palavras encontradas
indice = match(palavras_similares, dataset$funcao, nomatch = 0)

#Vetor auxiliar
classificacao = rep("outros", nrow(dataset))

#Substituindo na posição das palavras encontradas
classificacao[indice] = "Python"

#Atribuindo o vetor ao Dataframe
dataset$classificacao = classificacao

dataset
#     nome   rua      funcao classificacao
#1 alberto assis programador        Python
#2   elisa  cons  enfermeira        Outros
#3   pedro assis       prog.        Python

Other mode using the package dplyr

library(dplyr)
(dataset <- mutate(dataset, classificacao = ifelse(dataset$funcao %in% palavras_similares, "Python", "Outros")))

#     nome   rua      funcao classificacao
#1 alberto assis programador        Python
#2   elisa  cons  enfermeira        Outros
#3   pedro assis       prog.        Python

Browser other questions tagged

You are not signed in. Login or sign up in order to post.