2
I have a dataframe with several variables of the type factors
that I need to recode. An example of my data is the following:
structure(list(ID = 1:100, faixa_etaria = structure(c(2L, 7L,
3L, 4L, 5L, 6L, 2L, 4L, 8L, 3L, 4L, 5L, 6L, 1L, 2L, 8L, 3L, 4L,
5L, 6L, 4L, 2L, 8L, 4L, 3L, 5L, 4L, 2L, 5L, 3L, 4L, 5L, 3L, 5L,
2L, 3L, 3L, 4L, 7L, 6L, 2L, 7L, 3L, 4L, 2L, 6L, 1L, 2L, 8L, 3L,
6L, 5L, 6L, 3L, 2L, 8L, 3L, 4L, 5L, 6L, 2L, 2L, 3L, 4L, 3L, 2L,
3L, 2L, 5L, 3L, 5L, 5L, 6L, 5L, 2L, 6L, 3L, 4L, 5L, 3L, 2L, 2L,
3L, 4L, 5L, 6L, 6L, 4L, 8L, 3L, 4L, 5L, 6L, 8L, 2L, 8L, 3L, 6L,
4L, 5L), .Label = c("menos de 18", "18 a 24", "25 a 31", "32 a 38",
"39 a 45", "46 a 52", "53 a 59", "60 ou mais"), class = "factor"),
escolaridade = structure(c(5L, 2L, 6L, 6L, 4L, 3L, 4L, 2L,
1L, 6L, 4L, 3L, 4L, 3L, 1L, 6L, 4L, 3L, 4L, 3L, 5L, 2L, 1L,
6L, 4L, 3L, 5L, 2L, 6L, 6L, 4L, 6L, 4L, 5L, 4L, 6L, 4L, 6L,
4L, 4L, 5L, 2L, 6L, 6L, 4L, 4L, 3L, 4L, 1L, 6L, 4L, 4L, 4L,
5L, 6L, 6L, 4L, 4L, 5L, 2L, 5L, 2L, 4L, 6L, 4L, 3L, 6L, 2L,
4L, 6L, 4L, 3L, 4L, 6L, 3L, 6L, 4L, 6L, 4L, 3L, 6L, 4L, 4L,
6L, 4L, 6L, 6L, 4L, 6L, 4L, 4L, 1L, 6L, 4L, 5L, 3L, 6L, 4L,
4L, 4L), .Label = c("fundamental_incompleto", "fundamental_completo",
"medio_incompleto", "medio_completo", "sup_incompleto", "superior_completo"
), class = "factor"), renda = c(850L, 1750L, 4300L, 5600L,
3015L, 1540L, 1590L, 3100L, 1340L, 6780L, 4743L, 1050L, 4300L,
640L, 1050L, 13052L, 2015L, 1050L, 2030L, 1434L, 2180L, 1050L,
1980L, 4300L, 2240L, 1430L, 2912L, 1050L, 5413L, 2849L, 3753L,
6210L, 3602L, 3892L, 1050L, 4236L, 2813L, 3450L, 4109L, 3895L,
2112L, 2712L, 2445L, 4453L, 1350L, 3789L, 1050L, 1210L, 2500L,
2180L, 2312L, 1890L, 4299L, 3015L, 1050L, 5155L, 2015L, 3944L,
3127L, 1248L, 2180L, 1050L, 2389L, 7221L, 3218L, 1050L, 2912L,
1050L, 2424L, 2663L, 4103L, 1345L, 4117L, 3339L, 1050L, 9345L,
2356L, 3568L, 4123L, 3567L, 1956L, 1050L, 1050L, 3983L, 3431L,
5131L, 2912L, 1050L, 6721L, 2183L, 1731L, 1050L, 4236L, 4789L,
2891L, 5013L, 3562L, 3012L, 2873L, 3983L)), class = "data.frame", row.names = c(NA,
-100L))
In the example above, I left only two variables factor
, to be the reproducible minimum. But there is more. These variables are "age group" and "schooling". In addition to them there is also "id" and "income".
The recoding I want to do is this:
- When a group/level (
level
) of a variablefactor
account for less than 10% of cases, i want to aggregate this group to another, of the same variable, that has the level (level
) nearest, immediately before or after. - When the level is initial or final, there is only one aggregation option (the second, or penultimate, respectively).
- When it is an intermediate level, I would like to aggregate it with the one who has the most similar average income to the level that will be aggregated.
Sorry, but I don’t have much idea how to start. From what I read, I imagine with the package dplyr
and with the function group_by
and mutate
, but I don’t know how to calculate relative frequencies and refer to the nearest groups and use this information in recoding.
Note that the question may be misspelled:
faixa_etaria
only has a level with 2 occurrences (exactly 10% of the cases), all the others have 3. How to make the grouping?– Rui Barradas
Dear Rui, thank you very much for your attempt and comment. I think the problem was that it had included only 20 cases. Now includes a larger sample of data (100 lines).
– Nayla