Recoding factors from the relative frequency of levels and the mean value of a continuous variable

Asked

Viewed 32 times

2

I have a dataframe with several variables of the type factors that I need to recode. An example of my data is the following:

structure(list(ID = 1:100, faixa_etaria = structure(c(2L, 7L, 
3L, 4L, 5L, 6L, 2L, 4L, 8L, 3L, 4L, 5L, 6L, 1L, 2L, 8L, 3L, 4L, 
5L, 6L, 4L, 2L, 8L, 4L, 3L, 5L, 4L, 2L, 5L, 3L, 4L, 5L, 3L, 5L, 
2L, 3L, 3L, 4L, 7L, 6L, 2L, 7L, 3L, 4L, 2L, 6L, 1L, 2L, 8L, 3L, 
6L, 5L, 6L, 3L, 2L, 8L, 3L, 4L, 5L, 6L, 2L, 2L, 3L, 4L, 3L, 2L, 
3L, 2L, 5L, 3L, 5L, 5L, 6L, 5L, 2L, 6L, 3L, 4L, 5L, 3L, 2L, 2L, 
3L, 4L, 5L, 6L, 6L, 4L, 8L, 3L, 4L, 5L, 6L, 8L, 2L, 8L, 3L, 6L, 
4L, 5L), .Label = c("menos de 18", "18 a 24", "25 a 31", "32 a 38", 
"39 a 45", "46 a 52", "53 a 59", "60 ou mais"), class = "factor"), 
    escolaridade = structure(c(5L, 2L, 6L, 6L, 4L, 3L, 4L, 2L, 
    1L, 6L, 4L, 3L, 4L, 3L, 1L, 6L, 4L, 3L, 4L, 3L, 5L, 2L, 1L, 
    6L, 4L, 3L, 5L, 2L, 6L, 6L, 4L, 6L, 4L, 5L, 4L, 6L, 4L, 6L, 
    4L, 4L, 5L, 2L, 6L, 6L, 4L, 4L, 3L, 4L, 1L, 6L, 4L, 4L, 4L, 
    5L, 6L, 6L, 4L, 4L, 5L, 2L, 5L, 2L, 4L, 6L, 4L, 3L, 6L, 2L, 
    4L, 6L, 4L, 3L, 4L, 6L, 3L, 6L, 4L, 6L, 4L, 3L, 6L, 4L, 4L, 
    6L, 4L, 6L, 6L, 4L, 6L, 4L, 4L, 1L, 6L, 4L, 5L, 3L, 6L, 4L, 
    4L, 4L), .Label = c("fundamental_incompleto", "fundamental_completo", 
    "medio_incompleto", "medio_completo", "sup_incompleto", "superior_completo"
    ), class = "factor"), renda = c(850L, 1750L, 4300L, 5600L, 
    3015L, 1540L, 1590L, 3100L, 1340L, 6780L, 4743L, 1050L, 4300L, 
    640L, 1050L, 13052L, 2015L, 1050L, 2030L, 1434L, 2180L, 1050L, 
    1980L, 4300L, 2240L, 1430L, 2912L, 1050L, 5413L, 2849L, 3753L, 
    6210L, 3602L, 3892L, 1050L, 4236L, 2813L, 3450L, 4109L, 3895L, 
    2112L, 2712L, 2445L, 4453L, 1350L, 3789L, 1050L, 1210L, 2500L, 
    2180L, 2312L, 1890L, 4299L, 3015L, 1050L, 5155L, 2015L, 3944L, 
    3127L, 1248L, 2180L, 1050L, 2389L, 7221L, 3218L, 1050L, 2912L, 
    1050L, 2424L, 2663L, 4103L, 1345L, 4117L, 3339L, 1050L, 9345L, 
    2356L, 3568L, 4123L, 3567L, 1956L, 1050L, 1050L, 3983L, 3431L, 
    5131L, 2912L, 1050L, 6721L, 2183L, 1731L, 1050L, 4236L, 4789L, 
    2891L, 5013L, 3562L, 3012L, 2873L, 3983L)), class = "data.frame", row.names = c(NA, 
-100L))

In the example above, I left only two variables factor, to be the reproducible minimum. But there is more. These variables are "age group" and "schooling". In addition to them there is also "id" and "income".

The recoding I want to do is this:

  1. When a group/level (level) of a variable factor account for less than 10% of cases, i want to aggregate this group to another, of the same variable, that has the level (level) nearest, immediately before or after.
  2. When the level is initial or final, there is only one aggregation option (the second, or penultimate, respectively).
  3. When it is an intermediate level, I would like to aggregate it with the one who has the most similar average income to the level that will be aggregated.

Sorry, but I don’t have much idea how to start. From what I read, I imagine with the package dplyr and with the function group_by and mutate, but I don’t know how to calculate relative frequencies and refer to the nearest groups and use this information in recoding.

  • 2

    Note that the question may be misspelled: faixa_etaria only has a level with 2 occurrences (exactly 10% of the cases), all the others have 3. How to make the grouping?

  • Dear Rui, thank you very much for your attempt and comment. I think the problem was that it had included only 20 cases. Now includes a larger sample of data (100 lines).

No answers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.