Create new matrix from a fairly large first efficiently

Question

Create new matrix from a fairly large first efficiently

Asked 11 years, 10 months ago

Viewed 381 times

6

Guys, in R, I have a very large database and want to create new columns. I will try to explain my problem with a very small matrix. Next, "1" means private school and "2", publishes. I have for example a database

>Data
Casa Escola 
 1     1
 1     1
 1     2
 1     2
 2     1
 2     2
 2     1
 3     1
 3     1
 3     1
 3     1

In this case, we conclude that house 1 has 4 residents who are in school, 2 in particular and 2 in public. Similarly, house 2 has 3 residents in school, 2 in particular and 1 in public. Finally, house 3 has 4 people in school and all in particular.

I want a new hue with the first column indicating the house; the second indicating the number of children in the house ; the third indicating the number of those who are in private school and Finally, the fourth indicating the number of children in public school. Something like that:

  >matrix1
  >   Casa    em_escola     part     publ
       1          4          2        2
       2          3          2        1
       3          4          4        0

I’ve made a code that I’ll show you next. The problem with this code is that my original matrix is too big and takes hours to run. Also, I need to do the same thing for other matrices. Next, my code

lista1<- unique(Data$Casa)
length(lista1)
n=length(lista1)

lista_aux<- c(1:n)


matrix1<- data.frame(lista_aux,lista1)
nrow(matrix1)


for(i in 1:n) 
{


matrix = subset(Data , control_uc == lista1[i] )
matrix1$em_escola[i] <- nrow(matrix)

mat1<- subset (matrix, Escola == "1" )
matrix1$part[i]<- nrow(mat1)

mat2<- subset(matrix, cod_freq_escola =="2" )
matrix1$publ[i]<- nrow(mat2)
}

matrix1 is the matrix I want... but like I said, I need a code a lot faster than that.. because it takes a long time to run in very large database

2 answers

5

You can use the library dplyr to make your code simpler and at the same time more efficient:

library(dplyr)

Data <- data.frame(Casa=c(1,1,1,1,2,2,2,3,3,3,3),
    Escola=c(1,1,2,2,1,2,1,1,1,1,1))

matrix1 <- Data %>%
    group_by(Casa) %>%
    summarise(em_escola = n(),
        part = sum(Escola == 1),
        publ = sum(Escola == 2))

matrix1

What if my Data database has another column called "weight". Suppose each house has a weight... as I do to keep this column in my new matrix?

– orrillo

2014/08/27 at 03:36
1

In that case you can do group_by(Casa, peso).

– rodrigorgs

2014/08/27 at 12:38
Just out of curiosity... which means "in school = n()" ?

– orrillo

2014/08/27 at 14:17
1

Actually, I want to know what the "n' role is( )"

– orrillo

2014/08/27 at 20:14
1

n() counts the number of rows in each group. It is dplyr-specific.

– rodrigorgs

2014/08/28 at 00:42

Browser other questions tagged r matrix

You are not signed in. Login or sign up in order to post.

by Carlos Cinelli • **16,826** points · Answer 1 · 2014-08-16T04:09:14+00:00

5

To complement, I also leave a reply with the data table.. Both dplyr and data.table are extremely fast for large databases. dplyr is, in my opinion, more intuitive and data.table is more flexible.

library(data.table)
Data <- data.table(Data)
matrix1 <- Data[,list(em_escola = length(Escola),
           part=sum(Escola==1),
           publ = sum(Escola==2)), by=Casa]

What if my Data database has another column called "weight". Suppose each house has a weight... as I do to keep this column in my new matrix?

– orrillo

2014/08/27 at 03:37
1

If the weight is a variable you want to group by, you put in the end, with the by, by=list(Casa, peso). If weight is a variable that you want to summarize by home, for example, average weight, you add in the middle list list(em_escola = length(Escola),
 part=sum(Escola==1),
 publ = sum(Escola==2),peso_medio = mean(peso))

– Carlos Cinelli

2014/08/27 at 18:38
still in this question, using data.table, how do I create a variable depending on a subset of another? For example, using dplyr, when summarizing, we can do something like this: novacol = fun(variable1[variable2 == "A"]), that is, I am applying the function fun in variable 1 restricted to elements where variable2 = A

– orrillo

2015/03/19 at 21:57