Create new column in the partial match-based dataframe of the string without repeats

Question

Create new column in the partial match-based dataframe of the string without repeats

Asked 5 years, 9 months ago

Viewed 55 times

2

I have a dataframe with two columns, being them GL GLDESC and wanted to add a third column called KIND based on column data GLDESC.

Dataframe:

      GL                             GLDESC
1 515100                        Payroll-ISL
2 515900                        Payroll-ICA
3 532300                           Bulk Gas
4 551000                          Supply AB
5 551000                        Supply XPTO
6 551100                          Supply AB
7 551300                   Material Interno

Whereas:

If the GLDESC contain the word Payroll anywhere string, KIND get back to me Payroll
If the GLDESC contain the word Supply anywhere string, KIND get back to me Supply
In all other cases, KIND is Other.

What is solved without problems with:


DF$KIND <- ifelse(grepl("supply", DF$GLDESC, ignore.case = T), "Supply", 
         ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))

But with that, I have everything you quote Supply, for example, classified. However, as in lines 4 and 5 of the DF, the same GL has two Supply, which for me is unnecessary. Actually, I need just one kind of GLDESC be classified case for the same GL the string repeats itself.

How to?

Edited: Deleting duplicates is not an output I can take. I need to keep everything where it is, just sort the first and skip the second.

1 answer

Browser other questions tagged r rstudio

You are not signed in. Login or sign up in order to post.

by Rui Barradas • **15,422** points · Answer 1 · 2019-09-25T17:13:03+00:00

You can use the grepl to give logical indices and then calculate positions in the intended result vector.

i <- grepl("Payroll", dados$GLDESC)
j <- grepl("Supply", dados$GLDESC)
dados$KIND <- c("Other", "Payroll", "Supply")[1 + i + 2*j]

dados
#      GL           GLDESC    KIND
#1 515100      Payroll-ISL Payroll
#2 515900      Payroll-ICA Payroll
#3 532300         Bulk Gas   Other
#4 551000        Supply AB  Supply
#5 551000      Supply XPTO  Supply
#6 551100        Supply AB  Supply
#7 551300 Material Interno   Other

dice.

dados <- read.table(text = "
      GL                             GLDESC
1 515100                        Payroll-ISL
2 515900                        Payroll-ICA
3 532300                           'Bulk Gas'
4 551000                          'Supply AB'
5 551000                        'Supply XPTO'
6 551100                          'Supply AB'
7 551300                   'Material Interno'
", header = TRUE)