How to separate a string from a certain line of a data.frame and at the same time create more rows?

Asked

Viewed 1,163 times

9

I have a data.frame with a column with strings like "123-235-203".

For example, the line:

string column1 column2
123-235-203 x y

I want to separate this string so that the row that contains it multiplies preserving the values of the other columns:

strsplit(string, "-")

Turning the line into 3 lines.

string column1 column2
123 x y
235 x y
203 x y

There is a way to do this using dplyr or tidyr?

  • You want the string column to turn three columns?

  • @Carloscinelli Sorry, there it is. I want to separate the string, separate by the strokes and turn into 3 lines.

3 answers

7


Hadley himself offers us the answer in the tidyr Vignette.

library(tidyr)
library(dplyr)
df <- data.frame(
x = 1:3,
y = c("a", "d,e,f", "g,h"),
stringsAsFactors = FALSE
)
df %>%
transform(y = strsplit(y, ",")) %>%
unnest(y)
  • in fact your solution is more elegant :)

5

Create a test basis:

b <- data_frame(x = 1:10,y = 1:10, string = rep("123-235-203", 10))

With the following code you can get what you wanted:

b %>% 
  separate(string, into = c('s1', 's2', 's3'), sep = '-') %>%
  gather(string, valor, -x, -y) %>%
  select(-string) %>%
  arrange(x)
  • The function separate of tidyr turns your string into three columns (S1, s2 and S3).
  • The function gather of tidyr multiplies the lines.
  • The function selectof dplyr remove the new column stringthat is no longer needed.
  • The arrange is just to make it easier to understand.

The results are below:

# Source: local data frame [30 x 3]
# 
#    x y valor
# 1  1 1   123
# 2  1 1   235
# 3  1 1   203
# 4  2 2   123
# 5  2 2   235
# 6  2 2   203
# 7  3 3   123
# 8  3 3   235
# 9  3 3   203
# 10 4 4   123
# .. . .   ...

In the case that strings can have variable sizes, but there is a maximum of "-"’s you can do so:

b <- data_frame(x = 1:10,y = 1:10, string = rep(c("123-203", "123-203-555"), length.out = 10))

b %>% 
  separate(string, sep = '-', into = c("s1", "s2", "s3"), extra = "merge") %>%
  gather(string, valor, -x, -y, na.rm = T) %>%
  select(-string) %>%
  arrange(x) 
  • added the argument extra = "merge" in the separate so that he does not return error.
  • added the argument na.rm = T so that it does not create lines with NA.
  • I don’t think that would solve the problem if the strigs are different. Let’s assume one is "123-345", and the other is "123-12-345". The first yields 2 strings and the second, 3. Anyway, I found a shape in the tidyr itself. I’ll post the answer below. Hadley is really great.

  • you are right, had not understood that this could be variable. There is a maximum at least?

  • I just posted an answer...

3

Here is a solution using the data.table package

> library(data.table)
> dTbl <- data.table(x = 1:3, y = c("a", "d,e,f", "g,h"))
> dTbl[, .(y=unlist(strsplit(y, ','))), by=x]
   x y
1: 1 a
2: 2 d
3: 2 e
4: 2 f
5: 3 g
6: 3 h
>

Browser other questions tagged

You are not signed in. Login or sign up in order to post.