Data similarity with various pandas values

Asked

Viewed 101 times

0

I have the following list of pandas

list

The objective of the program is to obtain the degree of similarity according to the entered data. in this case it is a program to query houses, and I have already done to consult houses with the same data that the user indicates, but in case the user indicates not to exist in the database the objective is to appear similar houses ...

to consult the houses with the same variables I used this code

lista = ListaCompleta[ListaCompleta.Concelho.isin(concelhos) & (ListaCompleta['Tipo Imovel'] == tipo_imv) & 
                                      (ListaCompleta['Estado'] == estad) & (ListaCompleta['Quartos'] == quar) & 
                                      ListaCompleta.Preco.notnull()]

but if what the user inserts does not exist in the database, I want to create a new column where you enter a value between 0 and 1 where 1 is exactly the same and 0 is not the same (for each row)

To calculate the similarity of each column of each row, I thought to use this code (I do not know if it is the best)

(the variables "Quart", "casa_banh", "area", "garag", "year" are entered by the user)

similar_quartos = (quart-ListaCompleta['Quartos'])/5
similar_casa_banho = (casa_banh-ListaCompleta['Casa Banho'])/3
similar_area = (area-ListaCompleta['Area'])/200
similar_garagem = (garag-ListaCompleta['Garagem'])/3
similar_ano = (ano-ListaCompleta['Ano'])/30

but then I need to add to the list, I tried this code, but it’s not giving

lista['similiariedade'] = lista[(similar_quartos+similar_casa_banho+similar_area+similar_garagem+similar_ano)/5]

and create a column with a value of 0 to 1 in each row of the list, to know which home is more similar to the one the user entered

1 answer

3


One approach is to calculate the distance of a new data and the data from the Data Frame by calculating the dissimilarity. For this, I suggest using the Distance from Gower. It works as follows:

Distance from Gower

Given a new data and data from df, we first calculate the distance between each attribute (df column), with 0 if equal (shortest possible distance) and 1 if the distance is maximum. Then we calculate the average distance of all attributes.

To categorical data, like your columns Concelho, Estado and Mobilada, the distance is 0 if it is the same category and 1 if it is of a different category. Example:

   Concelho  Estado  Mobilada
0   Caminha  Usado   nao
1   Caminha  Usado   sim

The distance between these lines is [0, 0, 1], that is to say, 0.33 on average, because Concelho and Estado are equal, but Mobilada is different.

To numerical data, be they continuous or ordinal, we calculate the distance by formula distancia = |Xi - Xj|/distancia_maxima_observada, where i is a line and j is another line. Example:

     Ano   Area    Preco
0   1995     80    75000
1   1937    132   105000
2   2007    252   775000
3   2009    225   697000
4   1995    234   385000

We have that the greatest distance observed in the spine Ano is 2009 - 1937 = 72, column Area is 252 - 80 = 172 and the column Preco is 775000 - 75000 = 700000. Thus, calculating the distance of each column for lines 0 and 1, and we have: [|1995-1937|/72, |80-132|/172, |75000-105000|/700000], and on average we have 0.38.

Applying with Python

Given this function doing the accounts described above:

def gower_distance(new_data, df):
    distances = []
    for column in df.columns: #Para cada coluna do df
        if (df[column].dtype == np.object): # Se o tipo da coluna for np.object, é um dado categórigo
            columns_distance = np.where(df[column] == new_data[column].values[0], 0, 1)
        else: # Se não for um dado categórico, é numérico
            max_range_observed = df[column].values.max()-df[column].values.min()
            columns_distance = ((df[column] - new_data[column].values[0]).abs()/(max_range_observed)).fillna(0).values

        distances.append(columns_distance)

    return np.array(distances).mean(axis=0)

Be a new given as:

new_data = pd.DataFrame(data={
    'Tipo Imovel':['Moradia'],
    'Estado':['Usado'],
    'Concelho':['Caminha'],
    'Quartos':[4],
    'Casa Banho':[1],
    'Mobilada':['sim'],
    'Area':[225],
    'Garagem':[1.0],
    'Ano':[2009],
    'Preco':[697000]})

We can add the column with the calculated distance like this:

df['gower_dist'] = gower_distance(new_data, df[df.columns.difference(['gower_dist'])])

However, this is a dissimilarity average. To calculate the similarity, just do: df['similaridade'] = 1 - df['gower_dist']

Thus, the column similaridade will have 1 if the new data is equal and 0 if it is exactly the opposite.

       Ano  Area  Casa Banho  Concelho  Estado  Garagem  Mobilada   Preco  Quartos  Tipo Imovel  gower_dist  similaridade
1   1937.0   132          1    Caminha   Usado      1.0      sim   105000       2    Moradia       0.288641   0.711359
2   2007.0   252          1    Caminha   Usado      1.0      nao   775000       3    Moradia       0.154618   0.845382
3   2009.0   225          1    Caminha   Usado      1.0      sim   697000       4    Moradia       0.000000   1.000000
4   1995.0   234          3    Caminha   Usado      1.0      sim   385000       5    Moradia       0.194248   0.805752

A caveat is that for data with NaN, cannot calculate the distance of Gower by the code above. I suggest treating them before or using a .dropna().

  • is giving me the following error: ufunc 'subtract' Did not contain a loop with Signature matching types dtype('<U21') dtype('<U21') dtype('<U21')')

  • This error occurs when trying to perform operations (+ or -) with integers and string, such as pd.Series(['abc', 'def']) - pd.Series([1, 2]). Can you give me more information about when the error occurs?

  • I was already able to solve, for some reason in the for cycle I was putting the column variable as string type, so it was an error, and I also used pd.to_numeric() to put everything numerical in Else so it didn’t give shit

Browser other questions tagged

You are not signed in. Login or sign up in order to post.