在向量化运算中计算余弦相似度

小编典典

在向量化运算中计算余弦相似度

all

我正在尝试计算二维数组之间的余弦相似度。

假设我有一个形状为 (5,4) 的数据框

df = pd.DataFrame(np.random.randn(20).reshape(5,4), columns=["ref_x", "ref_y", "alt_x", "alt_y"])
df


ref_x   ref_y   alt_x   alt_y
0   2.523641    1.270625    0.127030    0.680601
1   -0.992681   -0.021022   0.461249    0.183311
2   -0.865873   -0.117191   -1.521882   -0.388608
3   -0.081354   -1.852463   -0.086464   0.249440
4   -0.057760   0.023642    0.002147    -1.009961

我知道如何用 scipy 计算余弦相似度。

这给了我余弦相似度。

df['sim'] = df.apply(lambda row: 1 - spatial.distance.cosine(row[['ref_x', 'ref_y']], row[['alt_x', 'alt_y']]), axis=1)

但它很慢（实际上我有一个大数据框，我想计算相似度）

我想做类似下面的事情，但它给了我一个“ValueError：输入向量应该是一维的”。信息

df['sim'] = 1 - spatial.distance.cosine(df[['ref_x', 'ref_y']], df[['alt_x', 'alt_y']])

有没有人有任何建议或意见？

阅读 101

2022-07-28

共1个答案

小编典典

使用 sklearn 中的 cosine_similarity

from sklearn.metrics.pairwise import cosine_similarity

df = pd.DataFrame(np.random.randn(20).reshape(5,4), columns=["ref_x", "ref_y", "alt_x", "alt_y"])
co_sim = cosine_similarity(df.to_numpy())
pd.DataFrame(co_sim)

输出：

    0   1   2   3   4
0   1.000000    0.085483    -0.126060   -0.137558   -0.411323
1   0.085483    1.000000    -0.447271   -0.277837   0.440389
2   -0.126060   -0.447271   1.000000    0.309562    -0.306372
3   -0.137558   -0.277837   0.309562    1.000000    -0.811515
4   -0.411323   0.440389    -0.306372   -0.811515   1.000000

2022-07-28