我是scikit-learn的新手,我一直TfidfVectorizer在寻找一组文档中术语的tfidf值。我使用以下代码来获取相同的代码。
TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words=u'english',ngram_range=(1,5),lowercase=True) X = vectorizer.fit_transform(lectures)
现在,如果我打印X,则可以看到矩阵中的所有条目,但是如何基于tfidf得分找到前n个条目。除此之外,还有什么方法可以帮助我根据每ngram的tfidf分数查找前n个条目,即unigram,bigram,trigram等中的排名靠前的条目?
从0.15版开始,TfidfVectorizer可以通过属性访问由a获知的特征的全局项加权,该属性idf_将返回一个长度等于特征维的数组。按此权重对要素进行排序,以获得权重最高的要素:
idf_
from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np lectures = ["this is some food", "this is some drink"] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(lectures) indices = np.argsort(vectorizer.idf_)[::-1] features = vectorizer.get_feature_names() top_n = 2 top_features = [features[i] for i in indices[:top_n]] print top_features
输出:
[u'food', u'drink']
使用ngram获取主要功能的第二个问题可以使用相同的想法来完成,还有一些额外的步骤将功能分为不同的组:
from sklearn.feature_extraction.text import TfidfVectorizer from collections import defaultdict lectures = ["this is some food", "this is some drink"] vectorizer = TfidfVectorizer(ngram_range=(1,2)) X = vectorizer.fit_transform(lectures) features_by_gram = defaultdict(list) for f, w in zip(vectorizer.get_feature_names(), vectorizer.idf_): features_by_gram[len(f.split(' '))].append((f, w)) top_n = 2 for gram, features in features_by_gram.iteritems(): top_features = sorted(features, key=lambda x: x[1], reverse=True)[:top_n] top_features = [f[0] for f in top_features] print '{}-gram top:'.format(gram), top_features
1-gram top: [u'drink', u'food'] 2-gram top: [u'some drink', u'some food']