估算句子之间“近似”语义相似性的一些好方法是什么？

小编典典

估算句子之间“近似”语义相似性的一些好方法是什么？

python

在过去的几个小时中，我一直在查看SO上的nlp标签，并且我有信心没有任何遗漏，但是如果我错过了，请指出我的问题。

在此同时，我将描述我要做什么。我在许多帖子中观察到的一个常见概念是语义相似性很困难。例如，从这篇文章中，接受的解决方案建议以下内容：

First of all, neither from the perspective of computational 
linguistics nor of theoretical linguistics is it clear what 
the term 'semantic similarity' means exactly. .... 
Consider these examples:

Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station.
Pete and Rob both like programming a lot.
Patricia found a dog near the station.
It was a dog who found Pete and Rob under the snow.

Which of the sentences 2-4 are similar to 1? 2 is the exact 
opposite of 1, still it is about Pete and Rob (not) finding a 
dog.

我的高级要求是利用k-
means聚类并基于语义相似性对文本进行分类，因此我所需要知道的是它们是否是近似匹配。例如，在上面的示例中，我可以将1,2,4,5归为一类，将3归为另一类（当然，将使用更多类似的句子来备份3）。可以找到相关的文章，但不必100％相关。

我认为我最终需要构建每个句子的向量表示形式，有点像它的指纹，但是这个向量究竟应该包含什么仍然是我的一个悬而未决的问题。它是n-
gram，还是词网中的某些东西，或者只是单个词干的单词，还是其他所有东西？

该线程在枚举所有相关技术方面做得非常出色，但不幸的是，当帖子达到我想要的功能时就停止了。关于该领域的最新技术有何建议？

阅读 180

2021-01-20

共1个答案

小编典典

潜在语义建模可能很有用。基本上，这只是奇异值分解的另一个应用。SVDLIBC是此方法的一个不错的C实现，虽然有点老套，但甚至还有python绑定，形式是sparsesvd。

2021-01-20