How to turn a column with a sentence per row into a column where each row is a word of these phrases?

Asked

Viewed 72 times

3

I have the following structure:

structure(list(frases = c("agricultura pecuária e serviços relacionados", 
"produção de lavouras temporárias", "cultivo de cereais", 
"cultivo de arroz", "cultivo de milho", "cultivo de trigo", "cultivo de outros cereais não especificados anteriormente", 
"cultivo de algodão herbáceo e de outras fibras de lavoura temporária", 
"cultivo de algodão herbáceo", "cultivo de juta", "cultivo de outras fibras de lavoura temporária não especificadas anteriormente", 
"cultivo de canadeaçúcar", "cultivo de canadeaçúcar", "cultivo de fumo", 
"cultivo de fumo")), row.names = c(NA, 15L), class = "data.frame")

I need to join all the words of each sentence (line) in just one column, getting more or less like this:

structure(list(palavras = c("agricultura", "pecuária", "e", "serviços", "relacionados", 
"produção", "de", "lavouras", "temporárias", "cultivo", "de", "cereais", 
"cultivo", "de", "arroz", "cultivo", "de", "milho", "cultivo","de", "trigo", "cultivo","de", "outros", "cereais", "não", "especificados", "anteriormente")), row.names = c(NA, 15L), class = "data.frame")

What I’ve done so far:

frases <- str_split(palavras_cnae_agro$palavras, fixed(" "))

palavras <- data.frame(matrix(unlist(frases), nrow=590, byrow=T),stringsAsFactors=FALSE)

BUT this resulted in a date.frame with 590 rows (number of sentences) and 7 columns. Each cell being a word.

I’m not getting these 7 columns together in just one with all the words.

2 answers

7


If I understand what you want as a result:

data.frame(palavras = unlist(strsplit(palavras_cnae_agro$palavras, ' ')))

EDITED

Instead of using ' ' as space, it is recommended to use syntax \\s+ of the regex which considers all types of space:

data.frame(palavras = unlist(strsplit(palavras_cnae_agro$palavras, '\\s+')))

Source: https://stackoverflow.com/a/39279181/6532002

  • Perfect! Thank you very much!

  • I noticed that some words came together, like "except furniture" and when I turned it into a vector, they were in the form "except nmóveis". You know what it can be?

  • 1

    Take a look at this answer: https://stackoverflow.com/a/39279181/6532002 It considers all formats of "space".

3

Although there is already an accepted answer, here is a solution with scan.

data.frame(palavras = scan(what = character(), 
                           text = palavras_cnae_agro[[1]]))
  • Very good too! Thank you!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.