Compare rows from a column

Asked

Viewed 148 times

2

I have a spreadsheet with about 15k lines, my problem is that in one of the most important columns for my analysis, the data are arranged in a very confusing way.
The column is about the complaint that motivated a process, for example: moral damage.

It turns out that in all lines the same type of complaint is written in different ways (e.g., moral harm, moral harm), someone can give me a light on how to normalize that column?

inserir a descrição da imagem aqui

In this third part of the image are the motivations of the process, and as you can see there are thousands of variables.

  • We need a more complete example of the data. If the database is called dados Please put the output of dput(head(dados, 30)). Or something representative of the data.

  • Create another column by reading your base column with the grep function ("moral damage|moral damage", column, ignore.case = TRUE)

1 answer

4


Huuum, I’ve been on your side. I’d say you have two options:

  • based on an analysis of the terms used, create a function that normalizes them, in the billiard style CASE WHEN of SQL, or if_else of R or dplyr::case_when.

  • based on similarity of strings I would try to put those that have more than a certain number of characters and more than x% similarity to the same term. In R has the package stringdist that does this.

Option 1 seems to me more "correct" in terms of criteria for classification of the reasons. So I’ll talk a little more about how she would do.

How would I make option 1

As it would be an analysis count of terms, I would use the package resources tidytext to count these words and get the 2-Rams and 3-Rams that are most common and try to put the most alike (and equal depending on your domain knowledge). With that I think you could kill a lot of the problem. The very granular cases, I would put in a category 'other' without weight in consciousness.

What the code would look like

I’m going to leave a snipet of how I would write this code, using the packets stringr, dplyr and purrr.

library(dplyr)
library(purrr)
library(stringr)

## dados para teste
toy_df <- tibble(motivo = c('danos morais', 'dano moral', 'danos materiais', 'dano x'))

## dados para facilitar a criação de vetores
make_vector <- function(string) stringr::str_split(string, pattern = ', ') %>% purrr::as_vector()


## a transformação em si

toy_df %>% 
mutate(classe = case_when(
  motivo %in% make_vector('dano moral, danos morais, danos whatever') ~ 'dano_moral',
  motivo %in% make_vector('danos materiais, dano material') ~ 'dano_material',       
  TRUE ~ 'outros'     
))

I prefer the use of dplyr::case_when because it is vectorized and ends up processing voluminous data a little faster. The syntax, although strange, is more functional than the if_else (where you’d be repeating if_else to everything that’s side).

The function make_vector this one just to facilitate the creation of string vectors with a cleaner syntax. In a perfect world these vectors that served to put the motives in the same classes would come out of an analysis of their, or something asism.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.