Example of Tidytext utility

Asked

Viewed 301 times

2

Hj in a lecture from my university I saw a package q is called Tidytext...I understood how it works, so q can’t think of any use of it.Could anyone give me an example of how we would take advantage of it in everyday problems? Thank you!

  • 2

    This link, made available by the package’s own creators, gives an idea of what it is capable of doing.

  • 1

    Welcome to Stackoverflow, @Gabrieloliveiraguimarães. Here you can find some tips on how to improve your next questions.

1 answer

6


The tidytext is a package that seeks to instrumentalize text analysis in a general way and therefore has 1001 utilities (and the most important of them can be found in the main vignette of the package, as pointed out by @Macusnunes). Among some of the possibilities that exist in the analysis of text that are implanted in tidytext I would highlight:

  1. Frequency of terms
  2. Term-document matrix (tdm, English)
  3. Frequency of terms - inverse of frequency in documents (Tf-idf, English)
  4. Feeling analysis

Utility example - Frequency of words

Step 1 - Pick up any text for analysis

# install.packages("devtools")
# devtools::install_github("tomasbarcellos/valorrr")

library(valorrr)
sessao <- html_session("http://www.valor.com.br/")
links <- links_pagina(sessao)
# Primeiras 20 notícias
noticias <- ler_noticia(sessao, links[1:20])

Now we have the text of the first 20 news now from the newspaper Valor Econômico.

Step 2 - Use tidytext to analyze texts

library(tidytext)
library(dplyr)
library(stringr)

noticias_tidy <- noticias %>% 
  select(titulo, texto) %>% 
  unnest_tokens(word, texto)

stop_port <- get_stopwords(language = "pt")

noticias_tidy %>% 
  anti_join(stop_port) %>%
  count(word, sort = TRUE)

Joining, by = "word"
# A tibble: 2,069 x 2
   word              n
   <chr>         <int>
 1 r                55
 2 é                52
 3 bilhões          44
 4 governo          37
 5 caminhoneiros    28
 6 paulo            27
 7 ônibus           26
 8 diesel           24
 9 petrobras        24
10 presidente       23
# ... with 2,059 more rows

Without reading any of the news we can already realize that the newspaper is today focused on matters about the strike of truck drivers and fuel policy.

Remark: The words r and é appear because we did not do any data cleanup to make this example simpler.

The use of bigramas makes this conclusion even more obvious:

regex_stop <- paste0("\\b", stop_port$word, "\\b", collapse = "|")

noticias_bigram <- noticias %>% 
  select(titulo, texto) %>% 
  mutate(texto = str_remove_all(texto, regex_stop)) %>% 
  unnest_tokens(word, texto, "ngrams", n = 2)

noticias_bigram %>% count(word, sort = TRUE)

# A tibble: 4,365 x 2
   word                    n
   <chr>               <int>
 1 são paulo              26
 2 quinta feira           13
 3 pis cofins             10
 4 greve caminhoneiros     8
 5 preço diesel            8
 6 15 dias                 7
 7 desta quinta            7
 8 nesta quinta            7
 9 além disso              6
10 capital paulista        6
# ... with 4,355 more rows

Step 3 - Choose your next goal

Once we have a structured text in the format tidy, The sky is the limit. Hence we could, for example, create term-document matrices that would feed a prediction model of the author of the text; or visualize the use of words in a cloud of words, etc.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.