Python - 文本分类 Python - 块分类 Python - Bigrams 很多时候,我们需要通过一些预先定义的标准将可用文本分类为各种类别。nltk提供此类功能作为各种语料库的一部分。在下面的示例中,我们查看电影评论语料库并检查可用的分类。 # Lets See how the movies are classified from nltk.corpus import movie_reviews all_cats = [] for w in movie_reviews.categories(): all_cats.append(w.lower()) print(all_cats) 当我们运行上面的程序时,我们得到以下输出 - ['neg', 'pos'] 现在让我们看一下带有正面评论的文件的内容。这个文件中的句子是标记化的,我们打印前四个句子来查看样本。 from nltk.corpus import movie_reviews from nltk.tokenize import sent_tokenize fields = movie_reviews.fileids() sample = movie_reviews.raw("pos/cv944_13521.txt") token = sent_tokenize(sample) for lines in range(4): print(token[lines]) 当我们运行上面的程序时,我们得到以下输出 - meteor threat set to blow away all volcanoes & twisters ! summer is here again ! this season could probably be the most ambitious = season this decade with hollywood churning out films like deep impact , = godzilla , the x-files , armageddon , the truman show , all of which has but = one main aim , to rock the box office . leading the pack this summer is = deep impact , one of the first few film releases from the = spielberg-katzenberg-geffen's dreamworks production company . 接下来,我们通过使用nltk中的FreqDist函数来标记每个文件中的单词并找到最常用的单词。 import nltk from nltk.corpus import movie_reviews fields = movie_reviews.fileids() all_words = [] for w in movie_reviews.words(): all_words.append(w.lower()) all_words = nltk.FreqDist(all_words) print(all_words.most_common(10)) 当我们运行上面的程序时,我们得到以下输出 - [(,', 77717), (the', 76529), (.', 65876), (a', 38106), (and', 35576), (of', 34123), (to', 31937), (u"'", 30585), (is', 25195), (in', 21822)] Python - 块分类 Python - Bigrams