Stopword does not work

Asked

Viewed 328 times

3

Hello,

I made the program that serves to do basic text preprocessing to use in Mining text tools. It must be something very trivial, but I can’t see where it could be wrong.

#####################################################
##
## Exemplo de Text Mining - pré-processamento
##
#####################################################
# clean workspace
rm(list=ls())
# change dir
my.d <- dirname(rstudioapi::getActiveDocumentContext()$path)
setwd(my.d)
# close all graphics
graphics.off()
#####################################################
library(tm) # Text Mining

text <- c("Isto nao é um têste.",
          "São Paulo está <*> sob os atáques de histéricos: 1280 pesoas já chegaram!",
          "Goiânia é capital de Goias - localizada no centro-oeste;",
          "Três catacrismas deverão // acontecer em menos de duas horas.")
text

library(dplyr)
dados <- data_frame(line = 1:4, Texto = text)
dados
##########################################
docs_corpus = Corpus(VectorSource(dados$Texto))
inspect(docs_corpus[1])
docs = docs_corpus

# Pré-processamento
# Remove caracteres especiais
for(j in seq(docs)){
  docs[[j]] <- gsub("/", " ", docs[[j]])
  docs[[j]] <- gsub("@", " ", docs[[j]])
  docs[[j]] <- gsub("\\|", " ", docs[[j]])
  docs[[j]] <- gsub("<>", " ", docs[[j]])
  docs[[j]] <- gsub("<*>", " ", docs[[j]])
}
# Mostra conteúdo
for (i in seq_along(docs))
  print(docs$content[i])

#Coloca tudo em minúsculo
docs <- tm_map(docs_corpus, tolower)
# Remove elementos indesejados
docs = tm_map(docs,removeNumbers)
docs = tm_map(docs,removePunctuation)

# Mostra conteúdo
for (i in seq_along(docs))
  print(docs$content[i])

# Tratar palavras grafadas com erro ou palavras compostas que não devem ser separadas,
# plural e singular, verbos na forma infinitiva
for (j in seq(docs)){
  docs[[j]] <- gsub("são paulo", "são_paulo", docs[[j]])
  docs[[j]] <- gsub("têste", "teste", docs[[j]])
  docs[[j]] <- gsub("atáque", "ataque", docs[[j]])
  docs[[j]] <- gsub("catacrismas", "cataclismas", docs[[j]])
  docs[[j]] <- gsub("pesoas", "pessoas", docs[[j]])
  docs[[j]] <- gsub("nao", "não", docs[[j]])
  docs[[j]] <- gsub("goias", "goiás", docs[[j]])
  docs[[j]] <- gsub("aaa*", "aaaaa", docs[[j]])
  docs[[j]] <- gsub('^[ha]+$', "aaaaa", docs[[j]])
}
docs = tm_map(docs,stripWhitespace) # Remove excesso de espaço entre as palavras

# Mostra conteúdo
for (i in seq_along(docs))
  print(docs$content[i])

# # Cria lista de stop words para portugues
stw = c(stopwords("portuguese"),"que","como","são","até","ser","é") #inserir a palavra para removê-la da análise
stw <- iconv(stw, "UTF-8", "latin1") # Altera a codificação para funcionar em Português
str(stw)
docs = tm_map(docs,removeWords,c(stw)) # Remove as stopwords

docs = tm_map(docs,stripWhitespace) # Remove excesso de espaço entre as palavras
inspect(docs[1])

# Mostra conteúdo
for (i in seq_along(docs))
  print(docs$content[i])

In my list of stopwords that contains the "is" among others, but only this term is not removed from my corpus.

Does anyone know what might be going on?

  • If you call the object stw after rotating the command iconv(stw, "UTF-8", "latin1"), you will see that the words são, até and é became NA, why the é is not being removed.

  • Thanks Rafael, that’s right. You could put an answer for me to mark it.

1 answer

4


The words são, até and é are turning NA when the function is applied iconv. If you omit this line from your code, the end result will be what you expect.

## Cria lista de stop words para portugues
stw = c(stopwords("portuguese"),"que","como","são","até","ser","é") #inserir a palavra para removê-la da análise
str(stw)
docs = tm_map(docs,removeWords,c(stw)) # Remove as stopwords

Browser other questions tagged

You are not signed in. Login or sign up in order to post.