accent str_extract()

Asked

Viewed 40 times

1

I need to do an analysis of books in Brazilian Portuguese. To organize a frequency list of words per book I am using the commands:

GS.tidy <- GS %>%
  unnest_tokens(word, text)
MM.tidy <- MM %>%
  unnest_tokens(word, text)
NS.tidy <- NS %>%
  unnest_tokens(word, text)
Sa.tidy <- Sa %>%
  unnest_tokens(word, text)
frequencia.guimaraes <- bind_rows(mutate(MM.tidy, livro = "MM"),
                                 mutate(GS.tidy, livro = "GS"),
                                 mutate(NS.tidy, livro = "NS"),
                                 mutate(Sa.tidy, livro = "Sa")) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(livro, word) %>%
  group_by(livro)

However I realized that the accented words are disappearing and they would need to stay. Is there any hint?

Thank you very much!

1 answer

2


As regular expressions are not the same everywhere, depend on the language or country, the locale of the system.

From the help page of regex, with link above:

The only Portable way to specify all ASCII Letters is to list them all as the Character class [ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz].

Translation Google Translate.

The only portable way to specify all ASCII letters is to list all of them as the character class [ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz].

This means that if we want the class of letters of the Portuguese language we will have to include all accented letters one by one. And in Brazilian Portuguese there will still be tremulous letters, such as "freqüência", abolished in Portugal in 1945.

Solution.

However, the solution is simple.
The class [:alpha:] works on R, although not guaranteed to be portable.

library(stringr)

s <- c("ate", "até", "freqüencia", "mão")

str_extract(s, "[A-Za-z''`~^]+")
#[1] "ate"  "at"   "freq" "m"  

str_extract(s, "[:alpha:]+")
#[1] "ate"        "até"        "freqüencia" "mão"
  • Thank you very much!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.