Stem for Twitter

Asked

Viewed 90 times

2

Dear colleagues, I am trying to do a twittering analysis of a Timeline and needed to stemiate the texts for analysis. I am trying the following procedure:

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
tweets <- userTimeline("Pragmatismo_", n = 3000)
tweets.df <- twListToDF(tweets)
myCorpus <- Corpus(VectorSource(tweets.df$text))
removeURL <- function(x) gsub("http[[:alnum:][:punct:]]*", "", x) 
removeNumPunct <- function(x) gsub('[[:punct:]]', '', x)
myCorpus <- tm_map(myCorpus, content_transformer(removeNumPunct))
myCorpus <- tm_map(myCorpus, content_transformer(removeURL))
myCorpus <- tm_map(myCorpus, ptstem)

The point is that even after the last command myCorpus <- tm_map(myCorpus, ptstem) the text does not appear stemmed.

Any tips? Thank you very much!

  • Does that help you? https://github.com/dfalbel/ptstem

  • I’ll try. Thank you!

  • The function ptstem which is used in tm_map At first it is not defined in the question, nor in the most common libraries for that purpose. Could you indicate which package you removed it from? Or enter the function you programmed...

1 answer

0


The main function of stemming maid in tm_map is the stemDocument, but as presented in the answer to that question, it is not possible to use the same for Portuguese due to a bug.

What I did to get around the situation was use the package quanteda:

library(quanteda)
my_dfm <- dfm_wordstem(myCorpus, language = "pt")

Another option would be to adapt if possible the use of the function ptstem::ptstem_words in its context (I have not tested).

Browser other questions tagged

You are not signed in. Login or sign up in order to post.