You can work with a library that makes a string Fuzzy.
String fuzzy is used to find similarities in strings even if there is some typing error. Fuzzywuzzy works with Levenshtein distance to calculate the differences between sequences.
installing the necessary packages
!pip install python-Levenshtein
!pip install fuzzywuzzy
importing the libs
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Example 1
opcoes = ["onix hatch ltz 1.4 8v flexpower 5p mec.",
"onix hatch lt 1.4 8v flexpower 5p mec.",
"onix hatch effect 1.4 8v f.power 5p mec.",
"onix hatch activ 1.4 8v flex 5p mec."]
process.extractOne("onix 1.4 mpfi ltz 8v", opcoes, scorer = fuzz.token_sort_ratio)
Exit
('onix hatch ltz 1.4 8v flexpower 5p mec.', 59)
Example 2
opcoes = ["gol city (trend)/titan 1.0 t. flex 8v 4p",
"gol (novo) 1.0 mi total flex 8v 4p"]
process.extractOne("gol 1.0 i 8v", opcoes, scorer = fuzz.token_sort_ratio)
Exit
('gol (novo) 1.0 mi total flex 8v 4p', 55)
Example 3
opcoes = ["aircross shine 1.6 flex 16v 5p aut.",
"aircross live 1.6 flex 16v 5p aut.",
"aircross feel 1.6 flex 16v 5p aut."]
process.extractOne("aircross 1.6 shine 16v", opcoes ,scorer = fuzz.token_sort_ratio)
Exit
('aircross shine 1.6 flex 16v 5p aut.', 79)
It is worth noting that in the example 2 gol 1.0 i 8v
the information can be extracted for both strings, as this is a very generic occurrence.
See the lib to learn more.
Another approach
Using Countvectorizer and the similarity of the cosine
Example 1
Importing the libs
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
Defining the options, in this case here we compare the first element of the list with the rest. The token_pattern I put for him to insert in the tokenization also the numbers 1, 1.4 and etc.
opcoes = [
"gol 1.0 i 8v",
"gol city (trend)/titan 1.0 t. flex 8v 4p",
"gol (novo) 1.0 mi total flex 8v 4p"
]
count_vectorizer_gol = CountVectorizer(token_pattern=r'(?u)\b[a-zA-Z0-9_.]+')
count_matrix_gol = count_vectorizer_gol.fit_transform(opcoes)
Checking the first element against the other elements
cosine_similarity(count_matrix_gol[0:1], count_matrix_gol[1:])
Exit
array([[0.5 , 0.53033009]])
The higher this number, the better. In this case the "goal (new) 1.0 mi total flex 8v 4p" had the highest score.
Example 2
opcoes = [
"aircross 1.6 shine 16v",
"aircross shine 1.6 flex 16v 5p aut.",
"aircross live 1.6 flex 16v 5p aut.",
"aircross feel 1.6 flex 16v 5p aut."
]
count_vectorizer_aircross = CountVectorizer(token_pattern=r'(?u)\b[a-zA-Z0-9_.]+')
count_matrix_aircross = count_vectorizer_aircross.fit_transform(opcoes)
cosine_similarity(count_matrix_aircross[0:1], count_matrix_aircross[1:])
Exit
array([[0.75592895, 0.56694671, 0.56694671]])
aircross Shine 1.6 flex 16v 5p aut. got highest score
Example 3
opcoes = [
"onix 1.4 mpfi ltz 8v",
"onix hatch ltz 1.4 8v flexpower 5p mec.",
"onix hatch lt 1.4 8v flexpower 5p mec.",
"onix hatch effect 1.4 8v f.power 5p mec.",
"onix hatch activ 1.4 8v flex 5p mec.",
"onix hatch ltz 1.2 8v flex 5p mec."
]
count_vectorizer_onix = CountVectorizer(token_pattern=r'(?u)\b[a-zA-Z0-9_.]+')
count_matrix_onix = count_vectorizer_onix.fit_transform(opcoes)
cosine_similarity(count_matrix_onix[0:1], count_matrix_onix[1:])
Exit
array([[0.63245553, 0.47434165, 0.47434165, 0.47434165, 0.47434165]])
Onix Hatch ltz 1.4 8v flexpower 5p Mec. got higher score
Reading the question is just a lexical search, because I understood the problem like this : There is a text file that describes vehicle models by their attributes, separated by space, being a vehicle model in each line, the user would type a search text and the search engine should find the line or lines whose editing distance is closest to the text to be searched. If this is the problem and if you present a sample of data I do an example.
– Augusto Vasques
The Naive Bayes algorithm should suit you
– Natan Fernandes