NLP Text sorting using Python


Viewed 75 times


I think my problem is text sorting, where I get a string as input, so I need to combine that string with the right choice. I need an accuracy greater than 98.3%. What kind of algorithm do I need to study to solve this problem presented? I took a look at Bag of Words and Word embedding but I’m not sure if they solve the problem

Exemplo 1:
    onix 1.4 mpfi ltz 8v
    Possíveis combinações:
    onix hatch ltz 1.4 8v flexpower 5p mec. (combinação correta)
    onix hatch lt 1.4 8v flexpower 5p mec.
    onix hatch effect 1.4 8v f.power 5p mec.
    onix hatch activ 1.4 8v flex 5p mec.

Exemplo 2:
    gol 1.0 i 8v
    Possíveis combinações:
    gol city (trend)/titan 1.0 t. flex 8v 4p
    gol (novo) 1.0 mi total flex 8v 4p (combinação correta)

Exemplo 3:
    aircross 1.6 shine 16v
    Possíveis combinações:
    aircross shine 1.6 flex 16v 5p aut. (combinação correta)
    aircross live 1.6 flex 16v 5p aut.
    aircross feel 1.6 flex 16v 5p aut.

I’m trying to match model names of cars from one site with model names of cars from another site.

  • Reading the question is just a lexical search, because I understood the problem like this : There is a text file that describes vehicle models by their attributes, separated by space, being a vehicle model in each line, the user would type a search text and the search engine should find the line or lines whose editing distance is closest to the text to be searched. If this is the problem and if you present a sample of data I do an example.

  • The Naive Bayes algorithm should suit you

1 answer


You can work with a library that makes a string Fuzzy. String fuzzy is used to find similarities in strings even if there is some typing error. Fuzzywuzzy works with Levenshtein distance to calculate the differences between sequences.

installing the necessary packages

!pip install python-Levenshtein
!pip install fuzzywuzzy

importing the libs

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

Example 1

opcoes = ["onix hatch ltz 1.4 8v flexpower 5p mec.", 
          "onix hatch lt 1.4 8v flexpower 5p mec.", 
          "onix hatch effect 1.4 8v f.power 5p mec.", 
          "onix hatch activ 1.4 8v flex 5p mec."]

process.extractOne("onix 1.4 mpfi ltz 8v", opcoes, scorer = fuzz.token_sort_ratio)


('onix hatch ltz 1.4 8v flexpower 5p mec.', 59)

Example 2

opcoes = ["gol city (trend)/titan 1.0 t. flex 8v 4p", 
          "gol (novo) 1.0 mi total flex 8v 4p"]

process.extractOne("gol 1.0 i 8v", opcoes,  scorer = fuzz.token_sort_ratio)


('gol (novo) 1.0 mi total flex 8v 4p', 55)

Example 3

opcoes = ["aircross shine 1.6 flex 16v 5p aut.", 
          "aircross live 1.6 flex 16v 5p aut.",
          "aircross feel 1.6 flex 16v 5p aut."]

process.extractOne("aircross 1.6 shine 16v", opcoes ,scorer = fuzz.token_sort_ratio)


('aircross shine 1.6 flex 16v 5p aut.', 79)

It is worth noting that in the example 2 gol 1.0 i 8v the information can be extracted for both strings, as this is a very generic occurrence.

See the lib to learn more.

Another approach

Using Countvectorizer and the similarity of the cosine

Example 1

Importing the libs

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Defining the options, in this case here we compare the first element of the list with the rest. The token_pattern I put for him to insert in the tokenization also the numbers 1, 1.4 and etc.

opcoes = [
          "gol 1.0 i 8v",
          "gol city (trend)/titan 1.0 t. flex 8v 4p", 
          "gol (novo) 1.0 mi total flex 8v 4p"

count_vectorizer_gol = CountVectorizer(token_pattern=r'(?u)\b[a-zA-Z0-9_.]+')
count_matrix_gol = count_vectorizer_gol.fit_transform(opcoes)

Checking the first element against the other elements

cosine_similarity(count_matrix_gol[0:1], count_matrix_gol[1:]) 


array([[0.5       , 0.53033009]])

The higher this number, the better. In this case the "goal (new) 1.0 mi total flex 8v 4p" had the highest score.

Example 2

opcoes = [
          "aircross 1.6 shine 16v",
          "aircross shine 1.6 flex 16v 5p aut.", 
          "aircross live 1.6 flex 16v 5p aut.",
          "aircross feel 1.6 flex 16v 5p aut."

count_vectorizer_aircross = CountVectorizer(token_pattern=r'(?u)\b[a-zA-Z0-9_.]+')
count_matrix_aircross = count_vectorizer_aircross.fit_transform(opcoes)

cosine_similarity(count_matrix_aircross[0:1], count_matrix_aircross[1:])


array([[0.75592895, 0.56694671, 0.56694671]])

aircross Shine 1.6 flex 16v 5p aut. got highest score

Example 3

opcoes = [
          "onix 1.4 mpfi ltz 8v",
          "onix hatch ltz 1.4 8v flexpower 5p mec.", 
          "onix hatch lt 1.4 8v flexpower 5p mec.", 
          "onix hatch effect 1.4 8v f.power 5p mec.", 
          "onix hatch activ 1.4 8v flex 5p mec.",
          "onix hatch ltz 1.2 8v flex 5p mec."

count_vectorizer_onix = CountVectorizer(token_pattern=r'(?u)\b[a-zA-Z0-9_.]+')
count_matrix_onix = count_vectorizer_onix.fit_transform(opcoes)

cosine_similarity(count_matrix_onix[0:1], count_matrix_onix[1:])


array([[0.63245553, 0.47434165, 0.47434165, 0.47434165, 0.47434165]])

Onix Hatch ltz 1.4 8v flexpower 5p Mec. got higher score

Browser other questions tagged

You are not signed in. Login or sign up in order to post.