Apply function by groups or factors to R

Asked

Viewed 51 times

2

Dear and esteemed,

I need to calculate the relative population growth of Brazilian municipalities, from one year to the next (final population minus the initial population, divided by the initial population). For that, I did a simple function:

var_rel <- function(x,y){
  ((x-y)/y)*100
}

In which x is the population of 2019 and y the population of 2014.

However, I am not managing to apply a command that runs the function by county. This is because I need to ensure that the x (population 2019) and y (2014 population) are from the same municipality.

Today, my function is applied as follows:

var_rel(pop[pop$ano=="2019","qtd_pop_mun"], pop[pop$ano=="2014","qtd_pop_mun"])

However, it may occur to be used as x the 2019 population of a municipality and how y the 2014 population of another municipality. Thus, I need to ensure that this does not occur.

I tried to use the function tapply, but as the function I created var_rel has arguments x and y, I don’t know how to include both in tapply.

My object has a column for year (from 2014 to 2019), another pro population size and another for the IBGE code of municipality (5570 municipalities of Brazil).

I need to solve this using the function I created for the calculation of the relative variation, because at other times I will have more customized calculations, in which there will be need to generate a function of its own, as I did at that time.

Thank you!

    ano key_cd7_ibge_mun qtd_pop_mun qtd_pop_est qtd_pop_pais
1  2014          1100015       25652     1748531    202768562
2  2015          1100015       25578     1768204    204450049
3  2016          1100015       25506     1787279    206081432
4  2017          1100015       25437     1805788    207660929
5  2018          1100015       23167     1757589    208494900
6  2019          1100015       22945     1777225    210147125
7  2014          1100023      102860     1748531    202768562
8  2015          1100023      104401     1768204    204450049
9  2016          1100023      105896     1787279    206081432
10 2017          1100023      107345     1805788    207660929
11 2018          1100023      106168     1757589    208494900
12 2019          1100023      107863     1777225    210147125
13 2014          1100031        6424     1748531    202768562
14 2015          1100031        6355     1768204    204450049
15 2016          1100031        6289     1787279    206081432
16 2017          1100031        6224     1805788    207660929
17 2018          1100031        5438     1757589    208494900
18 2019          1100031        5312     1777225    210147125
  • Can you please, edit the question with the departure of dput(pop[c("pop", "qtd_pop_mun")]) or, if the base is too large, dput(head(pop[c("pop", "qtd_pop_mun")], 20))?

  • "pop" is just the object name, there is no column with that name. I edited with the first 20 rows of the object and all columns.

  • You’re absolutely right, I meant pop[c("ano", etc)].

    1. The column key_cd7_ibge_mun is the county code? 2) There are only lines with ano == 2014, can remake the data example in such a way that there are data of both years for the same municipalities?
  • Yes, this column is the code of the municipality. I have data from 2014 to 2019. The same municipality repeats 6 times, one for each year.

2 answers

2

Using data.table:

library(data.table)

dados <- fread(text =
'ano,key_cd7_ibge_mun,qtd_pop_mun,qtd_pop_est,qtd_pop_pais
2014,1100015,25652,1748531,202768562
2015,1100015,25578,1768204,204450049
2016,1100015,25506,1787279,206081432
2017,1100015,25437,1805788,207660929
2018,1100015,23167,1757589,208494900
2019,1100015,22945,1777225,210147125
2014,1100023,102860,1748531,202768562
2015,1100023,104401,1768204,204450049
2016,1100023,105896,1787279,206081432
2017,1100023,107345,1805788,207660929
2018,1100023,106168,1757589,208494900
2019,1100023,107863,1777225,210147125
2014,1100031,6424,1748531,202768562
2015,1100031,6355,1768204,204450049
2016,1100031,6289,1787279,206081432
2017,1100031,6224,1805788,207660929
2018,1100031,5438,1757589,208494900
2019,1100031,5312,1777225,210147125')

> dados[, .(crescimento = (qtd_pop_mun[.N]-qtd_pop_mun[1])/qtd_pop_mun[1]*100), by = key_cd7_ibge_mun]
   key_cd7_ibge_mun crescimento
1:          1100015  -10.552783
2:          1100023    4.863893
3:          1100031  -17.310087

data table. is great for performing group operations on large data tables. The syntax is similar to that of SQL databases, see the package introduction sticker for more information: vignette('datatable-intro', 'data.table')

A data frame. can be converted to data table. using setDT(seu.data.frame), but you better use fread to directly read your file.

1

This is a basic R solution.
Adopts the strategy split/apply/combine hadley Wickham.

The function var_rel is rewritten to have a single argument.

var_rel <- function(x){
  100*(x[2] - x[1])/x[1]
}

i1 <- pop$ano == 2014
i2 <- pop$ano == 2019
sp <- split(pop[i1 | i2, c(1, 3)], pop$key_cd7_ibge_mun[i1 | i2])
sapply(sp, function(DF){
  var_rel(DF[['qtd_pop_mun']])
})
#   1100015    1100023    1100031 
#-10.552783   4.863893 -17.310087 

Browser other questions tagged

You are not signed in. Login or sign up in order to post.