我想使用python计算文件中所有双字母(相邻单词对)的出现次数。在这里,我正在处理非常大的文件,因此我正在寻找一种有效的方法。我尝试对文件内容使用带有正则表达式“ \ w + \ s \ w +”的计数方法,但事实证明效率不高。
例如,假设我要计算文件a.txt中的二元数,该文件具有以下内容:
"the quick person did not realize his speed and the quick person bumped "
对于上述文件,bigram集及其计数为:
(the,quick) = 2 (quick,person) = 2 (person,did) = 1 (did, not) = 1 (not, realize) = 1 (realize,his) = 1 (his,speed) = 1 (speed,and) = 1 (and,the) = 1 (person, bumped) = 1
我遇到了一个Python中Counter对象的示例,该示例用于计数unigram(单个单词)。它还使用正则表达式方法。
该示例如下所示:
>>> # Find the ten most common words in Hamlet >>> import re >>> from collections import Counter >>> words = re.findall('\w+', open('a.txt').read()) >>> print Counter(words)
上面代码的输出是:
[('the', 2), ('quick', 2), ('person', 2), ('did', 1), ('not', 1), ('realize', 1), ('his', 1), ('speed', 1), ('bumped', 1)]
我想知道是否有可能使用Counter对象来获取二元数。除了Counter对象或regex之外的任何方法也将受到赞赏。
一些itertools魔术:
itertools
>>> import re >>> from itertools import islice, izip >>> words = re.findall("\w+", "the quick person did not realize his speed and the quick person bumped") >>> print Counter(izip(words, islice(words, 1, None)))
输出:
Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1, ('did', 'not'): 1, ('not', 'realize'): 1, ('and', 'the'): 1, ('speed', 'and'): 1, ('person', 'bumped'): 1, ('his', 'speed'): 1, ('realize', 'his'): 1})
奖金
获取任何n-gram的频率:
from itertools import tee, islice def ngrams(lst, n): tlst = lst while True: a, b = tee(tlst) l = tuple(islice(a, n)) if len(l) == n: yield l next(b) tlst = b else: break >>> Counter(ngrams(words, 3))
Counter({('the', 'quick', 'person'): 2, ('and', 'the', 'quick'): 1, ('realize', 'his', 'speed'): 1, ('his', 'speed', 'and'): 1, ('person', 'did', 'not'): 1, ('quick', 'person', 'did'): 1, ('quick', 'person', 'bumped'): 1, ('did', 'not', 'realize'): 1, ('speed', 'and', 'the'): 1, ('not', 'realize', 'his'): 1})
这也适用于懒惰的迭代器和生成器。因此,您可以编写一个生成器,逐行读取文件,生成单词,然后将其传递ngarms给懒惰使用,而无需读取内存中的整个文件。
ngarms