Gensim Word2Vec从预训练模型中选择次要词向量集

小编典典

Gensim Word2Vec从预训练模型中选择次要词向量集

python

我在gensim中有一个大型的经过预训练的Word2Vec模型，我想从中使用预训练的词向量作为Keras模型中的嵌入层。

问题在于嵌入量很大，而且我不需要大多数单词向量（因为我知道哪些单词可以作为输入出现）。因此，我想摆脱它们以减少嵌入层的大小。

有没有一种方法可以根据词白名单来保留所需的词向量（包括对应的索引！）？

阅读 221

2021-01-20

共1个答案

小编典典

多亏了这个答案（我对代码进行了一些更改以使其变得更好）。您可以使用此代码解决问题。

我们有所有次要的单词集restricted_word_set（可以是列表的也可以是集合），并且w2v是我们的模型，因此下面是函数：

import numpy as np

def restrict_w2v(w2v, restricted_word_set):
    new_vectors = []
    new_vocab = {}
    new_index2entity = []
    new_vectors_norm = []

    for i in range(len(w2v.vocab)):
        word = w2v.index2entity[i]
        vec = w2v.vectors[i]
        vocab = w2v.vocab[word]
        vec_norm = w2v.vectors_norm[i]
        if word in restricted_word_set:
            vocab.index = len(new_index2entity)
            new_index2entity.append(word)
            new_vocab[word] = vocab
            new_vectors.append(vec)
            new_vectors_norm.append(vec_norm)

    w2v.vocab = new_vocab
    w2v.vectors = np.array(new_vectors)
    w2v.index2entity = np.array(new_index2entity)
    w2v.index2word = np.array(new_index2entity)
    w2v.vectors_norm = np.array(new_vectors_norm)

警告： 首次创建模型时，vectors_norm == None如果在此使用此功能，则会出现错误。首次使用vectors_norm
后将获得类型的值numpy.ndarray。因此，在使用该函数之前，请尝试尝试most_similar("cat")使之vectors_norm不等于None。

它基于Word2VecKeyedVectors重写与单词相关的所有变量。

用法：

w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
w2v.most_similar("beer")

[（’beers’，0.8409687876701355），
（’lager’，0.7733745574951172），
（’Beer’，0.71753990650177），
（’drinks’，0.668931245803833），
（’lagers’，0.6570086479187012），
（’Yuengling_Lager’，0.655455470085144），
（ ‘microbrew’，0.6534324884414673），
（’Brooklyn_Lager’，0.6501551866531372），
（’suds’，0.6497018337249756），
（’brewed_beer’，0.6490240097045898）]

restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
restrict_w2v(w2v, restricted_word_set)
w2v.most_similar("beer")

[（’lagers’，0.6570085287094116），
（’wine’，0.6217695474624634），
（’bash’，0.20583480596542358，
（’computer’，0.06677375733852386），
（’python’，0.005948573350906372）]]

它也可以用于删除一些单词。

2021-01-20