Create filter to select data without duplicity, from rules applied to variables

Asked

Viewed 72 times

0

I have constantly uploaded a data frame with more than 6 million records and 40 variables (columns). From these records, I need to select only one record per patient and that this record is as significant as possible. This significance is based on the variables: "Paciente", "ResultadodoExame" and "ResultadoComplementar", following a priority criterion (1, 2, 3 and 4) for sorting. The lower the number (order - before data start) the higher the priority.

For example: the priority of variable data "Resultados" is in the following order:

1-Pos, 2-Neg, 3-Inter and 9-Empty

and the variable "ResultadoComplementar" is in order:

1-DN1, 2-DN2, 3-DN3, 4-DN4 and 9-Empty.

From this variable, I need to create a filter labeling with "0" or "1", being "1" for the significant record.

I need to perform other tasks, which depends on the entire dataset, so it is important to have the filter (variable).

The data frame below has the simulation of the data. need to create a routine that manages the Filter ("0" and "1") field, remembering that "1" is always the significant)

####Data frame
a=c("Matheus Fulano da Silva","Matheus Fulano da Silva","Fernandes Fulano da Silva","Fernandes Fulano da Silva","Fernandes Fulano da Silva","Fernandes Fulano da Silva","Manuel Fulano da Silva","Manuel Fulano da Silva","Manuel Fulano da Silva","Manuel Fulano da Silva","Manuel Fulano da Silva", "Carlos Fulano da Silva","Carlos Fulano da Silva","Carlos Fulano da Silva","Carlos Fulano da Silva","Joao Fulano da Silva","Joao Fulano da Silva","Joao Fulano da Silva","Joao Fulano da Silva")

b=c("9-Vazio","3-Inter","2-Neg","1-Pos","1-Pos","3-Inter","1-Pos","1-Pos","9-Vazio","2-Neg","3-Inter","1-Pos","2-Neg","9-Vazio","1-Pos","2-Neg","3-Inter","1-Pos","2-Neg")

c=c("9-Vazio","9-Vazio","9-Vazio","2-DN2","1-DN1","9-Vazio","1-DN1","1-DN1","9-Vazio","9-Vazio","9-Vazio","9-Vazio","9-Vazio","9-Vazio","1-DN1","2-DN2","3-DN3","9-Vazio","9-Vazio")

#meu data frame
d=data.frame(a,b,c)
names(d)<-c("Paciente","ResultadoExame","ResultadoComplementar")

# preciso criar uma função para gerar o filtro
d=c("0","1","0","0","1","0","1","0","0","0","0","0","0","0","1","0","0","1","0")
#Objetivo final
d=data.frame(a,b,c,d)
names(d)<-c("Paciente","ResultadoExame","ResultadoComplementar","filtro")

From now on, thank you very much.

Hug

  • 1

    Note that you actually want to create an indicator, to later apply a filter. As this future indicator depends on 3 variables, do they have the same weight? The sum of the 3 can be a good indicator of significance?

  • Hello Mouradev. Initially I thought about this indicator (sum), but it didn’t work because there are cases I can have in the same patient (two lines of results) that the total of the sum will be the same. For example, a 2-Neg in the first line of the patient, and in the second line of patient 3-In...

1 answer

0

Note that the name line 18 and the name line 12 have the same characteristics, but different results for the filter you provided as an example, this can create bias to your goal.

inserir a descrição da imagem aqui

Anyway, realize that with data manipulation it is possible to create an indicator for the result and filter patients.

Initially I modified the ResultadoExame and ResultadoComplementar for numbers. I created a kind of weight for empty results (vazior1 and vazior2) and then I summed it up.

I filtered the smallest sums, then removed duplicate results per patient. I obtained the same results, however the patient of line 15 of the data frame d is the same as line 12.

  • Manipulating the data from data frame d to generate the results.
# manipulacao para gerar indice/filtro
library(dplyr)
library(data.table)
df <- d %>% 
  dplyr::mutate(ResultadoExame = ifelse(ResultadoExame == "9-Vazio", 9, ResultadoExame),
                ResultadoComplementar = ifelse(ResultadoComplementar ==  "9-Vazio", 9, ResultadoComplementar),
                vazio = ifelse(ResultadoExame == 9 & ResultadoComplementar == 9, 10, 1),
                vazior1 = ifelse(ResultadoExame == 9, 0, 1),
                vazior2 = ifelse(ResultadoComplementar == 9, 0, 1),
                Resultado = ifelse(ResultadoExame == 9 & ResultadoComplementar == 9, 100, # 100 => valor qualquer considerado alto
                                   (ResultadoExame*vazior1 + ResultadoComplementar*vazior2)*vazio)) %>% 
  dplyr::select(Paciente, ResultadoExame, ResultadoComplementar, Resultado) %>% 
  dplyr::mutate(linha_d = 1:nrow(d)) %>% 
  dplyr::filter(Resultado < 10) %>% 
  dplyr::arrange(ResultadoExame, ResultadoComplementar, Resultado) %>% 
  as.data.table()
df <- df[, .SD[which.min(Resultado)], by = Paciente]
df

The data frame with the results of patients who should be treated primarily.

> df
                    Paciente ResultadoExame ResultadoComplementar Resultado linha_d
1: Fernandes Fulano da Silva              1                     1         2       5
2:    Manuel Fulano da Silva              1                     1         2       7
3:    Carlos Fulano da Silva              1                     9         1      12
4:      Joao Fulano da Silva              1                     9         1      18
5:   Matheus Fulano da Silva              3                     9         3       2
  • Returning the answer to the data frame d.
d <- d %>% 
  dplyr::mutate(linha_d = 1:nrow(d)) %>% 
  dplyr::left_join(df, by = "linha_d") %>% # inserindo dados a partir do df
  dplyr::select(Paciente.x, ResultadoExame.x, ResultadoComplementar.x, Resultado) %>% 
  dplyr::mutate(Resultado = ifelse(is.na(Resultado), 0 , 1)) %>% 
  dplyr::rename("Paciente" = Paciente.x, 
                "ResultadoExame" = ResultadoExame.x, 
                "ResultadoComplementar" = ResultadoComplementar.x)
dplyr::sample_n(tbl = d, 5)
> dplyr::sample_n(tbl = d, 5)
                   Paciente ResultadoExame ResultadoComplementar Resultado
1 Fernandes Fulano da Silva          1-Pos                 1-DN1         1
2      Joao Fulano da Silva        3-Inter                 3-DN3         0
3   Matheus Fulano da Silva        3-Inter               9-Vazio         1
4      Joao Fulano da Silva          1-Pos               9-Vazio         1
5      Joao Fulano da Silva          2-Neg                 2-DN2         0

  • Data used.
# seus dados
a = c("Matheus Fulano da Silva","Matheus Fulano da Silva","Fernandes Fulano da Silva",
    "Fernandes Fulano da Silva","Fernandes Fulano da Silva","Fernandes Fulano da Silva",
    "Manuel Fulano da Silva","Manuel Fulano da Silva","Manuel Fulano da Silva",
    "Manuel Fulano da Silva","Manuel Fulano da Silva", "Carlos Fulano da Silva",
    "Carlos Fulano da Silva","Carlos Fulano da Silva","Carlos Fulano da Silva",
    "Joao Fulano da Silva","Joao Fulano da Silva","Joao Fulano da Silva","Joao Fulano da Silva")
b = c("9-Vazio","3-Inter","2-Neg","1-Pos","1-Pos","3-Inter","1-Pos","1-Pos","9-Vazio",
    "2-Neg","3-Inter","1-Pos","2-Neg","9-Vazio","1-Pos","2-Neg","3-Inter","1-Pos","2-Neg")
c = c("9-Vazio","9-Vazio","9-Vazio","2-DN2","1-DN1","9-Vazio","1-DN1","1-DN1","9-Vazio",
    "9-Vazio","9-Vazio","9-Vazio","9-Vazio","9-Vazio","1-DN1","2-DN2","3-DN3","9-Vazio","9-Vazio")
d = data.frame(Paciente = a, ResultadoExame = b, ResultadoComplementar = c)
  • Thanks for the reply, @bbiasi.

  • I currently use a procedure to create this filter, but I have to create 2 data frame. I use the code below. #my data frame d=data.frame(a,b,c) #creating Indice (sequence) d=d[order(d$a,d$b,d$c),] d$e=(1:nrow(d)) Names(d)<-c("Patient","Resultsproasurement","Resultadd-on","Sequence") View(d) #selecting patients without duplicity library(dplyr) e=d %>% distinct(Patient, . keep_all = TRUE) e$filter='1' View(e) #including filter in data frame d <-d %>% left_join(e) #treating the values "NA" d$filter=ifelse(is.na(d$filter), "0", d$filter) View(d)

  • I edited the answer.

  • **Thank you, @bbiasi **

Browser other questions tagged

You are not signed in. Login or sign up in order to post.