Comparing names from two different lists - ideas on ways to compare

Asked

Viewed 40 times

0

I’m trying to compare company names between two different lists. The case is as follows: both lists may contain different names for the same company, for example:
Nome1: Escalar
Nome2: Escalar Group
Nome3: Escalar LTDA

Only I would need the algorithm to bring me that they’re the same company. A few things I’ve tried:

  • Check, for example, if Name1 contains in Name2. So it works for this case, but some companies the second name may be different (example: "Escalar Participações" and "Escalar Empreendimentos"), so this case does not apply. Below is the comparison code I’m using:

     for j in range(len(bancodados["COMPANY NAME"])):
        print("j = " + str(j))
        for i in range(len(arquivo2["Nome do Cliente"])):
            p1 = re.compile(r"\b" + str(bancodados["COMPANY NAME"][j]) + r"\b", re.IGNORECASE)
            if p1.search(str(arquivo2["Nome do Cliente"][i])):
                bancodados["Contém"][j] = arquivo2["Nome do Cliente"][i]
    
  • As explicit in the example above, I also used regex to get only the whole names, and not a part of the name, IE, if I compare the name "Escalar" will get "Escalar LTDA", but not "Escalares".

  • In other tests, using regex, I made the algorithm exclude names like LTDA, SA, ect, as well as make a double comparison (check if Nome1 contains in Nome2, and vice versa), in an attempt to increase the effectiveness of the algorithm.

    What is the objective of this project: to verify which companies already belong to a previously defined list, using as a comparison a second list of companies. For this project, I can not have in the final list companies that exist on the second list. However, I still have some names that are present in the two lists that the algorithm cannot detect, which are basically cases of companies with compound and different names, such as: "Amireia Pajoara" and "Pajoara Industria e Comercio".
    So I was wondering if someone could give me an insight into what I can test to increase the efficiency of the algorithm.

  • Search for NLP is best suited to your case.

  • @Augustovasques I was looking for something a little more practical and fast, because I don’t have very deep knowledge of NLP. But I appreciate it anyway, and I’ll look into it.

  • I don’t understand, "Amireia Pajoara" and "Pajoara Industria e Comercio" are the same company or not? Where do these names come from and how do you know that "Escalar", "Escalar Group" and "Escalar LTDA" are the same company? Somewhere you have this information, right? It may be easier to have, for each company, a list of all possible names, so you compare each name with the whole list (and then you wouldn’t even need to regex)

  • @hkotsubo, those names would be from the same company yes. What happens is that the comparison list might come with a version of the company name that we haven’t mapped yet, and then the algorithm wouldn’t be as efficient as using regex either. At first we can not say that "Amireia Pajoara" and "Pajoara Industria e Comercio" are the same company, but for my process, it is preferable that I exclude from my final list rather than run the risk of being the same company.

  • at the time of comparing the two data, tried with a in? I don’t know if I understood exactly, but I had something similar, two lists, I would walk through one and use the in to know if there is value in others. Understood?

No answers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.