Separate contents from one column in other columns

Asked

Viewed 7,816 times

8

I have a file in table form. I need to separate the contents of a specific column in other columns, the data are separated by ;.

Well I managed to do this, but the problem is that the content will result for each row a different number of columns. And by this code it replicates up to the content until it reaches the number of columns. I wanted it to display IN.

By my code I get this:

  v1      v2      v3    p/sep  sep sep  sep sep  
dados1  dados2  dados3  a;b;c   a   b   c   a  
dados1  dados2  dados3  a;c     a   c   a   c  
dados1  dados2  dados3  a       a   a   a   a  
dados1  dados2  dados3  a;b     a   b   a   b  
dados1  dados2  dados3  a;b;c;d a   b   c   d  
dados1  dados2  dados3  a;b     a   b   a   b  

But the result I hope to have is:

  v1      v2      v3 separar  sep sep   sep sep  
dados1  dados2  dados3  a;b;c   a   b   c   NA  
dados1  dados2  dados3  a;c     a   c   a   c  
dados1  dados2  dados3  a       a   NA  NA  NA  
dados1  dados2  dados3  a;b     a   b   NA  NA  
dados1  dados2  dados3  a;b;c;d a   b   c   d  
dados1  dados2  dados3  a;b     a   b   NA  NA  

library(reshape)        
file_split = data.frame(file,colsplit(file$separar,split=";",names="buffer",))
  • 1

    In the second line of the expected response is actually "a c a c" or "a c NA"

  • There’s another error in the code, there’s no comma after "buffer", and also this argument does not make much sense because it defines a column name, being that 4 are created. Not that it’s a big problem in this case, but the ideal is to put the exact code and the exact output generated, apparently you edited some things by hand.

  • 'buffer" was a syntax error, I thought it would repeat the buffer name as header of all new columns @Molx

  • was typo @Marcosbanik

  • 1

    @Joycemaia if you want to use "buffer", you need to differentiate the columns somehow, because columns with equal names are not a good idea. One way to do this is to use paste as I did in my reply.

3 answers

4

You were very close to the solution. If instead of the reshape use the reshape2, using the same function (switching the argument split for pattern), you come close to what you would like:

> file <- read.table(text="  v1      v2      v3    separar  sep sep  sep sep  
dados1  dados2  dados3  a;b;c   a   b   c   a  
dados1  dados2  dados3  a;c     a   c   a   c  
dados1  dados2  dados3  a       a   a   a   a  
dados1  dados2  dados3  a;b     a   b   a   b  
dados1  dados2  dados3  a;b;c;d a   b   c   d  
dados1  dados2  dados3  a;b     a   b   a   b", header=T, as.is=T)[,1:4]
> library(reshape2)        
> file_split = data.frame(file, colsplit(file$separar, pattern=";", names=paste0("sep", 1:4)))

Resulting in

> file_split
      v1     v2     v3 separar sep1 sep2 sep3 sep4
1 dados1 dados2 dados3   a;b;c    a    b    c     
2 dados1 dados2 dados3     a;c    a    c          
3 dados1 dados2 dados3       a    a               
4 dados1 dados2 dados3     a;b    a    b          
5 dados1 dados2 dados3 a;b;c;d    a    b    c    d
6 dados1 dados2 dados3     a;b    a    b          

This is a matter of default function definition. In both packages, it is not possible to control the behavior for number of different columns, but the reshape uses recycling, while the reshape2complete with empty strings. To replace empty spaces with NA, you can do, for example:

> file_split[file_split == ""] <- NA
> file_split
      v1     v2     v3 separar sep1 sep2 sep3 sep4
1 dados1 dados2 dados3   a;b;c    a    b    c <NA>
2 dados1 dados2 dados3     a;c    a    c <NA> <NA>
3 dados1 dados2 dados3       a    a <NA> <NA> <NA>
4 dados1 dados2 dados3     a;b    a    b <NA> <NA>
5 dados1 dados2 dados3 a;b;c;d    a    b    c    d
6 dados1 dados2 dados3     a;b    a    b <NA> <NA>

It is only important to be careful here because this substitution changes all values "" from the table, it may be safer to use the column names/indexes to avoid the risk of changing other data that should remain empty and not NA (if any).


Before you find that difference in reshape2 I made an alternative using base and following that tip. It’s not so succinct, but since it’s ready, it’s below:

> file.split <- strsplit(file$separar, ";")
> n.obs <- sapply(file.split, length)
> seq.max <- seq_len(max(n.obs))
> resultado <- cbind(file, t(sapply(file.split, "[", i = seq.max)))
> resultado
      v1     v2     v3 separar 1    2    3    4
1 dados1 dados2 dados3   a;b;c a    b    c <NA>
2 dados1 dados2 dados3     a;c a    c <NA> <NA>
3 dados1 dados2 dados3       a a <NA> <NA> <NA>
4 dados1 dados2 dados3     a;b a    b <NA> <NA>
5 dados1 dados2 dados3 a;b;c;d a    b    c    d
6 dados1 dados2 dados3     a;b a    b <NA> <NA>

3


The biggest difficulty is providing names for the new columns. I will adopt the same variable file and the same scheme for column names that Molx used.

If you know beforehand the number of columns contained in separar. (4, in that case)

colunas <- paste0("sep", 1:4)

The number of columns can be obtained by means of the function str_count package stringr.

library(stringr);
colunas <- paste0("sep", 1:(max(str_count(file$separar, ';')) + 1));

Now it is possible to split the column separar using the function separate package tidry.

library(tidyr);
separate(file, separar, into = colunas, sep = ';', remove = FALSE,
         extra = "merge");

-1

library(reshape)        
file_split = data.frame(file,colsplit(file$separar,split=";",names="buffer",))

To name both columns you need to change the end names = c("Buffer", "Buffer2"))

I mean, it stays that way

library(reshape)        
file_split = data.frame(file,colsplit(file$separar,split=";",names= c("buffer","Buffer2")))

Browser other questions tagged

You are not signed in. Login or sign up in order to post.