The idea is to divide all lines of your text into words (for example, using strsplit
), concatenate all words and count the instances of each word (for example, using table
). The code below shows a possible implementation:
contaPalavras <- function(linhas) {
palavras <- strsplit(linhas, "\\W+")
todas <- unlist(palavras)
contagem <- table(todas)
contagem[order(-contagem)]
}
linhas <- c(
"Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna.",
"Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.",
"Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Proin pharetra nonummy pede. Mauris et orci.",
"Aenean nec lorem. In porttitor. Donec laoreet nonummy augue.",
"Suspendisse dui purus, scelerisque at, vulputate vitae, pretium mattis, nunc. Mauris eget neque at sem venenatis eleifend. Ut nonummy.",
"Fusce aliquet pede non pede. Suspendisse dapibus lorem pellentesque magna. Integer nulla.",
"Donec blandit feugiat ligula. Donec hendrerit, felis et imperdiet euismod, purus ipsum pretium metus, in lacinia nulla nisl eget sapien. Donec ut est in lectus consequat consequat.",
"Etiam eget dui. Aliquam erat volutpat. Sed at lorem in nunc porta tristique.",
"Proin nec augue. Quisque aliquam tempor magna. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.",
"Nunc ac magna. Maecenas odio dolor, vulputate vel, auctor ac, accumsan id, felis. Pellentesque cursus sagittis felis.")
contaPalavras(linhas)
Note that you will probably want to remove words you don’t want to count, such as articles, conjunctions, prepositions, etc., but this depends on the rules of your business.
Thank you very much :D
– user20273
Luíz, it is not serious but it seems to me that your solution is giving strange words (no, already). (utf8 problems?)
– JJoao
@Jjoao Boy, you’re right. Thank you for noticing and warning. : ) I’ll make the correction to use Unicode in processing.
– Luiz Vieira
Good +1 by the Lusiades.
– JJoao