Know the frequency of words

Asked

Viewed 2,670 times

7

Hello, I wonder if there is a function/command in R that I can know which are the most frequent words in a text and how often it appears.

For example, I have a very large database, but it is a database in which each line is text.

And I’d like to know which words have the most frequency in my database, and how often they appear.

3 answers

6

I don’t know if there’s a function for it, but once I used the suggested code in this tutorial to make this count. The final code of the tutorial is this (not exactly the code of the tutorial, but the one I used adapted):

texto <- scan("oslusiadas.txt", what="char", sep="\n", encoding = "UTF-8")
texto <- tolower(texto)

lista_palavras <- strsplit(texto, "\\W+")
vetor_palavras <- unlist(lista_palavras)

frequencia_palavras <- table(vetor_palavras)
frequencia_ordenada_palavras <- sort(frequencia_palavras, decreasing=TRUE)

palavras <- paste(names(frequencia_ordenada_palavras), frequencia_ordenada_palavras, sep=";")

cat("Palavra;Frequencia", palavras, file="frequencias.csv", sep="\n")    

In this test I told the words of the poem "The Lusiads", available at page of the Gutenberg project. In the text file used I removed the license clauses and other texts in English, leaving only the poem. The first two lines of the code read the file (in Unicode, since the text contains accented characters) and normalize the text (converting everything to lowercase). The next two lines do the word separation on a vector, the next two lines do the frequency count (when each word appears) and sort that count downwards (the most appearing words are placed first). It is important not to use "Pearl format" in the regular expression used in the function strsplit, because he does not correctly treat the accented words (i.e., use pearl=FALSE or do not use the parameter, since false is the value default). And finally, the last line saves the result in a text file (I used the semicolon as a separator).

The result is something like this, and the file can be imported into Excel (for example):

Palavra;Frequencia
que;2741
e;2221
o;1953
a;1858
de;1438
se;981
os;750
;742
do;627
não;585
com;574
por;538
em;519
as;516
da;487
lhe;401
no;326
já;309
mais;283
mas;283
na;252
um;239
quem;232
ao;231
gente;230
dos;227
terra;222
tão;210
para;205
rei;204
como;195
mar;188
onde;177
the;176
é;160
seu;155
[...]
  • 1

    Thank you very much :D

  • 1

    Luíz, it is not serious but it seems to me that your solution is giving strange words (no, already). (utf8 problems?)

  • @Jjoao Boy, you’re right. Thank you for noticing and warning. : ) I’ll make the correction to use Unicode in processing.

  • 1

    Good +1 by the Lusiades.

3

The idea is to divide all lines of your text into words (for example, using strsplit), concatenate all words and count the instances of each word (for example, using table). The code below shows a possible implementation:

contaPalavras <- function(linhas) {
    palavras <- strsplit(linhas, "\\W+")
    todas <- unlist(palavras)
    contagem <- table(todas)
    contagem[order(-contagem)]
}
linhas <- c(
    "Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna.",
    "Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.",
    "Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Proin pharetra nonummy pede. Mauris et orci.",
    "Aenean nec lorem. In porttitor. Donec laoreet nonummy augue.",
    "Suspendisse dui purus, scelerisque at, vulputate vitae, pretium mattis, nunc. Mauris eget neque at sem venenatis eleifend. Ut nonummy.",
    "Fusce aliquet pede non pede. Suspendisse dapibus lorem pellentesque magna. Integer nulla.",
    "Donec blandit feugiat ligula. Donec hendrerit, felis et imperdiet euismod, purus ipsum pretium metus, in lacinia nulla nisl eget sapien. Donec ut est in lectus consequat consequat.",
    "Etiam eget dui. Aliquam erat volutpat. Sed at lorem in nunc porta tristique.",
    "Proin nec augue. Quisque aliquam tempor magna. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.",
    "Nunc ac magna. Maecenas odio dolor, vulputate vel, auctor ac, accumsan id, felis. Pellentesque cursus sagittis felis.")
contaPalavras(linhas)

Note that you will probably want to remove words you don’t want to count, such as articles, conjunctions, prepositions, etc., but this depends on the rules of your business.

  • Thank you very much :D

1

The package tokenizers helps to do this in a very easy way!

Example:

library(tokenizers)

tokenize_words(linhas, lowercase = FALSE) %>%
  unlist() %>%
  table() %>%
  sort(decreasing = TRUE)

The legal of tokenizers is that it already does some treatments like turn everything into tiny, withdraw score, etc.

In addition, it has other functions such as tokenizer_ngrams that instead of counting words, it would count combinations of words.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.