First of all, neither from the perspective of computational linguistics nor of theoretical linguistics is it clear what the term 'semantic similarity' means exactly. .... Consider these examples: Pete and Rob have found a dog near the station. Pete and Rob have never found a dog near the station. Pete and Rob both like programming a lot. Patricia found a dog near the station. It was a dog who found Pete and Rob under the snow. Which of the sentences 2-4 are similar to 1? 2 is the exact opposite of 1, still it is about Pete and Rob (not) finding a dog.
我的高级要求是利用k- means聚类并基于语义相似性对文本进行分类,因此我所需要知道的是它们是否是近似匹配。例如,在上面的示例中,我可以将1,2,4,5归为一类,将3归为另一类(当然,将使用更多类似的句子来备份3)。可以找到相关的文章,但不必100%相关。
我认为我最终需要构建每个句子的向量表示形式,有点像它的指纹,但是这个向量究竟应该包含什么仍然是我的一个悬而未决的问题。它是n- gram,还是词网中的某些东西,或者只是单个词干的单词,还是其他所有东西?