1
I am trying to parallelize a function that calculates the Cosine similarity:
Here is my code:
import numpy as np
def cos_sim(a,b):
dot_product = np.dot(a,b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
if(norm_a == 0 or norm_b == 0):
return 0
else:
return dot_product / (norm_a * norm_b)
def newsimilarityitem(matriz):
cs = []
for i in range(0, len(matriz)):
cs.append([0]*len(matriz))
for i in range(0,len(matriz)-1): #AQUI
for l in range(i+1,len(matriz)): #AQUI
a = np.array(matriz[i])
b = np.array(matriz[l])
r = cos_sim(a,b)
cs[i][l] = r
cs[l][i] = r
return cs
What the code does:
matriz = [[4,3,0,0,5,0],
[5,0,4,0,4,0],
[4,0,5,3,4,0],
[0,3,0,0,0,5],
[0,4,0,0,0,4],
[0,0,2,4,0,5]]
Given a matrix (not necessarily quadratic) where the rows are represented by items and the columns are represented by users and the cells are notes, I will calculate the cosene similarity between the items Cosine similarity. The function is called:
matriz_simi = newsimilarityitem(matriz)
In function the matrix cs
(Mandatory quadratic) will present the similarity ie: given the index of an item i
and the index of another item l
the similarity of an item is cs[i][l]
or cs[l][i]
. The function cos_sim(a,b)
will take two numpy array and calculate the similarity.
I’m trying to parallelize the two ties marked above. Currently the complexity is O(n²/2)(I suppose), but parallelizing will save a lot of time since in recommendations I can have thousands of users and products.
Currently my machine has 4 colors and I am using multiprocessing, but I am open to any kind of library that can facilitate this task.