我知道如何使用NLTK来获得二元组和三元组的搭配,并将它们应用于我自己的语料库。代码如下。
但是我不确定(1)如何获取特定单词的搭配?(2)NLTK是否具有基于对数似然比的搭配度量?
import nltk from nltk.collocations import * from nltk.tokenize import word_tokenize text = "this is a foo bar bar black sheep foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence" trigram_measures = nltk.collocations.TrigramAssocMeasures() finder = TrigramCollocationFinder.from_words(word_tokenize(text)) for i in finder.score_ngrams(trigram_measures.pmi): print i
试试这个代码:
import nltk from nltk.collocations import * bigram_measures = nltk.collocations.BigramAssocMeasures() trigram_measures = nltk.collocations.TrigramAssocMeasures() # Ngrams with 'creature' as a member creature_filter = lambda *w: 'creature' not in w ## Bigrams finder = BigramCollocationFinder.from_words( nltk.corpus.genesis.words('english-web.txt')) # only bigrams that appear 3+ times finder.apply_freq_filter(3) # only bigrams that contain 'creature' finder.apply_ngram_filter(creature_filter) # return the 10 n-grams with the highest PMI print finder.nbest(bigram_measures.likelihood_ratio, 10) ## Trigrams finder = TrigramCollocationFinder.from_words( nltk.corpus.genesis.words('english-web.txt')) # only trigrams that appear 3+ times finder.apply_freq_filter(3) # only trigrams that contain 'creature' finder.apply_ngram_filter(creature_filter) # return the 10 n-grams with the highest PMI print finder.nbest(trigram_measures.likelihood_ratio, 10)
它使用似然测度,还过滤掉不包含“生物”一词的Ngram。