Huuum, I’ve been on your side. I’d say you have two options:
based on an analysis of the terms used, create a function that normalizes them, in the billiard style CASE WHEN
of SQL
, or if_else
of R
or dplyr::case_when
.
based on similarity of strings I would try to put those that have more than a certain number of characters and more than x% similarity to the same term. In R
has the package stringdist
that does this.
Option 1 seems to me more "correct" in terms of criteria for classification of the reasons. So I’ll talk a little more about how she would do.
How would I make option 1
As it would be an analysis count of terms, I would use the package resources tidytext
to count these words and get the 2-Rams and 3-Rams that are most common and try to put the most alike (and equal depending on your domain knowledge). With that I think you could kill a lot of the problem. The very granular cases, I would put in a category 'other' without weight in consciousness.
What the code would look like
I’m going to leave a snipet of how I would write this code, using the packets stringr
, dplyr
and purrr
.
library(dplyr)
library(purrr)
library(stringr)
## dados para teste
toy_df <- tibble(motivo = c('danos morais', 'dano moral', 'danos materiais', 'dano x'))
## dados para facilitar a criação de vetores
make_vector <- function(string) stringr::str_split(string, pattern = ', ') %>% purrr::as_vector()
## a transformação em si
toy_df %>%
mutate(classe = case_when(
motivo %in% make_vector('dano moral, danos morais, danos whatever') ~ 'dano_moral',
motivo %in% make_vector('danos materiais, dano material') ~ 'dano_material',
TRUE ~ 'outros'
))
I prefer the use of dplyr::case_when
because it is vectorized and ends up processing voluminous data a little faster. The syntax, although strange, is more functional than the if_else
(where you’d be repeating if_else to everything that’s side).
The function make_vector
this one just to facilitate the creation of string vectors with a cleaner syntax. In a perfect world these vectors that served to put the motives in the same classes would come out of an analysis of their, or something asism.
We need a more complete example of the data. If the database is called
dados
Please put the output ofdput(head(dados, 30))
. Or something representative of the data.– Rui Barradas
Create another column by reading your base column with the grep function ("moral damage|moral damage", column, ignore.case = TRUE)
– Daniel Ikenaga