python中的n克，四克，五克，六克？

小编典典

python中的n克，四克，五克，六克？

python

我正在寻找一种将文本拆分为n-gram的方法。通常我会做类似的事情：

import nltk
from nltk import bigrams
string = "I really like python, it's pretty awesome."
string_bigrams = bigrams(string)
print string_bigrams

我知道nltk仅提供二元组和三元组，但是有没有办法将我的文本分为四克，五克甚至一百克？

谢谢！

阅读 165

2021-01-20

共1个答案

小编典典

其他用户提供的基于本地Python的出色答案。但是这就是nltk方法（以防万一，OP会因为重新发明nltk库中已经存在的内容而受到惩罚）。

有一个NGRAM模块，人们很少使用nltk。这不是因为很难读取ngram，而是基于ngram训练模型，其中n>
3将导致大量数据稀疏。

from nltk import ngrams

sentence = 'this is a foo bar sentences and i want to ngramize it'

n = 6
sixgrams = ngrams(sentence.split(), n)

for grams in sixgrams:
  print grams

2021-01-20