模糊分组依据，对相似词进行分组

小编典典

模糊分组依据，对相似词进行分组

algorithm

但是对于如何“分组”项目没有给出明确的答案。基于difflib的解决方案基本上是搜索，对于给定的项目，difflib可以从列表中返回最相似的单词。但是如何将其用于分组？

我想减少

['ape', 'appel', 'apple', 'peach', 'puppy']

至

['ape', 'appel', 'peach', 'puppy']

要么

['ape', 'apple', 'peach', 'puppy']

我尝试过的一个想法是，对于每个项目，都要遍历列表，如果get_close_matches返回多个匹配项，请使用它，如果不按原样保留单词。这部分起作用，但是可以建议苹果换苹果，然后苹果换苹果，这些词只会改变位置，什么都不会改变。

我将不胜感激任何指针，库名称等。

注意：同样在性能方面，我们有300,000个项目的列表，并且get_close_matches似乎有点慢。有谁知道基于C / ++的解决方案？

谢谢，

注意：进一步的调查显示kmedoid是正确的算法（以及分层聚类），因为kmedoid不需要“中心”，所以它本身将数据点用作中心（这些点称为medoids，因此称为名称）。在单词分组的情况下，medoid将是该组/簇的代表元素。

阅读 219

2020-07-28

共1个答案

小编典典

您需要标准化组。在每个组中，选择一个代表该组的单词或编码。然后按其代表对单词进行分组。

一些可能的方式：

选择第一个遇到的单词。
选择词典编排的第一个单词。
推导所有单词的模式。
选择一个唯一索引。
使用soundex作为样式。

但是，将单词分组可能很困难。如果A类似于B，并且B类似于C，则A和C不一定彼此相似。如果B是代表，则A和C都可以包括在组中。但是，如果A或C是代表，则另一个不能包括在内。

遵循第一个选择（第一个遇到的单词）：

class Seeder:
    def __init__(self):
        self.seeds = set()
        self.cache = dict()

    def get_seed(self, word):
        LIMIT = 2
        seed = self.cache.get(word,None)
        if seed is not None:
            return seed
        for seed in self.seeds:
            if self.distance(seed, word) <= LIMIT:
                self.cache[word] = seed
                return seed
        self.seeds.add(word)
        self.cache[word] = word
        return word

    def distance(self, s1, s2):
        l1 = len(s1)
        l2 = len(s2)
        matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
        for zz in xrange(0,l2):
            for sz in xrange(0,l1):
                if s1[sz] == s2[zz]:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
                else:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
        return matrix[l2][l1]

import itertools

def group_similar(words):
    seeder = Seeder()
    words = sorted(words, key=seeder.get_seed)
    groups = itertools.groupby(words, key=seeder.get_seed)
    return [list(v) for k,v in groups]

例：

import pprint

print pprint.pprint(group_similar([
    'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
    'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
    'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
    'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
    'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
    'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
    'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
    'people', 'into', 'year', 'your', 'good', 'some', 'could',
    'them', 'see', 'other', 'than', 'then', 'now', 'look',
    'only', 'come', 'its', 'over', 'think', 'also', 'back',
    'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
    'way', 'even', 'new', 'want', 'because', 'any', 'these',
    'give', 'day', 'most', 'us'
]), width=120)

输出：

[['after'],
 ['also'],
 ['and', 'a', 'in', 'on', 'as', 'at', 'an', 'one', 'all', 'can', 'no', 'want', 'any'],
 ['back'],
 ['because'],
 ['but', 'about', 'get', 'just'],
 ['first'],
 ['from'],
 ['good', 'look'],
 ['have', 'make', 'give'],
 ['his', 'her', 'if', 'him', 'its', 'how', 'us'],
 ['into'],
 ['know', 'new'],
 ['like', 'time', 'take'],
 ['most'],
 ['of', 'I', 'it', 'for', 'not', 'he', 'you', 'do', 'by', 'we', 'or', 'my', 'so', 'up', 'out', 'go', 'me', 'now'],
 ['only'],
 ['over', 'our', 'even'],
 ['people'],
 ['say', 'she', 'way', 'day'],
 ['some', 'see', 'come'],
 ['the', 'be', 'to', 'that', 'this', 'they', 'there', 'their', 'them', 'other', 'then', 'use', 'two', 'these'],
 ['think'],
 ['well'],
 ['what', 'who', 'when', 'than'],
 ['with', 'will', 'which'],
 ['work'],
 ['would', 'could'],
 ['year', 'your']]

2020-07-28