2
Hj in a lecture from my university I saw a package q is called Tidytext...I understood how it works, so q can’t think of any use of it.Could anyone give me an example of how we would take advantage of it in everyday problems? Thank you!
2
Hj in a lecture from my university I saw a package q is called Tidytext...I understood how it works, so q can’t think of any use of it.Could anyone give me an example of how we would take advantage of it in everyday problems? Thank you!
6
The tidytext
is a package that seeks to instrumentalize text analysis in a general way and therefore has 1001 utilities (and the most important of them can be found in the main vignette of the package, as pointed out by @Macusnunes). Among some of the possibilities that exist in the analysis of text that are implanted in tidytext
I would highlight:
# install.packages("devtools")
# devtools::install_github("tomasbarcellos/valorrr")
library(valorrr)
sessao <- html_session("http://www.valor.com.br/")
links <- links_pagina(sessao)
# Primeiras 20 notícias
noticias <- ler_noticia(sessao, links[1:20])
Now we have the text of the first 20 news now from the newspaper Valor Econômico.
library(tidytext)
library(dplyr)
library(stringr)
noticias_tidy <- noticias %>%
select(titulo, texto) %>%
unnest_tokens(word, texto)
stop_port <- get_stopwords(language = "pt")
noticias_tidy %>%
anti_join(stop_port) %>%
count(word, sort = TRUE)
Joining, by = "word"
# A tibble: 2,069 x 2
word n
<chr> <int>
1 r 55
2 é 52
3 bilhões 44
4 governo 37
5 caminhoneiros 28
6 paulo 27
7 ônibus 26
8 diesel 24
9 petrobras 24
10 presidente 23
# ... with 2,059 more rows
Without reading any of the news we can already realize that the newspaper is today focused on matters about the strike of truck drivers and fuel policy.
Remark: The words
r
andé
appear because we did not do any data cleanup to make this example simpler.
The use of bigramas
makes this conclusion even more obvious:
regex_stop <- paste0("\\b", stop_port$word, "\\b", collapse = "|")
noticias_bigram <- noticias %>%
select(titulo, texto) %>%
mutate(texto = str_remove_all(texto, regex_stop)) %>%
unnest_tokens(word, texto, "ngrams", n = 2)
noticias_bigram %>% count(word, sort = TRUE)
# A tibble: 4,365 x 2
word n
<chr> <int>
1 são paulo 26
2 quinta feira 13
3 pis cofins 10
4 greve caminhoneiros 8
5 preço diesel 8
6 15 dias 7
7 desta quinta 7
8 nesta quinta 7
9 além disso 6
10 capital paulista 6
# ... with 4,355 more rows
Once we have a structured text in the format tidy
, The sky is the limit. Hence we could, for example, create term-document matrices that would feed a prediction model of the author of the text; or visualize the use of words in a cloud of words, etc.
Browser other questions tagged r
You are not signed in. Login or sign up in order to post.
This link, made available by the package’s own creators, gives an idea of what it is capable of doing.
– Marcus Nunes
Welcome to Stackoverflow, @Gabrieloliveiraguimarães. Here you can find some tips on how to improve your next questions.
– Tomás Barcellos