如何在字符串而不是列表中输出NLTK pos

小编典典

如何在字符串而不是列表中输出NLTK pos_tag？

python

我需要在大型数据集上运行nltk.pos_tag，并且需要使其输出像斯坦福标记器提供的那样。

例如，在运行以下代码时，我有：

import nltk
text=nltk.word_tokenize("We are going out.Just you and me.")
print nltk.pos_tag(text)

输出为：[[‘We’，’PRP’），（’are’，’VBP’），（’going’，’VBG’），（’out.Just’，’IN’），（’you
‘，’PRP’），（’and’，’CC’），（’me’，’PRP’），（’。’，’。’）]

在这种情况下，我需要像这样：

 We/PRP are/VBP going/VBG out.Just/NN you/PRP and/CC me/PRP ./.

我更喜欢不使用字符串函数，并且需要直接输出，因为文本数量如此之多，并且给处理增加了很多时间上的复杂性

阅读 194

2021-01-20

共1个答案

小编典典

简而言之：

' '.join([word + '/' + pos for word, pos in tagged_sent]

总而言之：

我认为您对使用字符串函数来连接字符串的想法过高，这实际上并不那么昂贵。

import time
from nltk.corpus import brown

tagged_corpus = brown.tagged_sents()

start = time.time()

with open('output.txt', 'w') as fout:
    for i, sent in enumerate(tagged_corpus):
        print(' '.join([word + '/' + pos for word, pos in sent]), end='\n', file=fout)

end = time.time() - start
print (i, end)

在我的笔记本电脑上，棕色语料库的所有57339个句子花了2.955秒。

[出]：

$ head -n1 output.txt 
The/AT Fulton/NP-TL County/NN-TL Grand/JJ-TL Jury/NN-TL said/VBD Friday/NR an/AT investigation/NN of/IN Atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.

但是使用字符串将单词和POS连接起来会在以后需要读取标记的输出时引起麻烦，例如

>>> from nltk import pos_tag
>>> tagged_sent = pos_tag('cat / dog'.split())
>>> tagged_sent_str = ' '.join([word + '/' + pos for word, pos in tagged_sent])
>>> tagged_sent_str
'cat/NN //CD dog/NN'
>>> [tuple(wordpos.split('/')) for wordpos in tagged_sent_str.split()]
[('cat', 'NN'), ('', '', 'CD'), ('dog', 'NN')]

如果要保存标记的输出然后再阅读，最好使用pickle保存tagd_output的方法，例如

>>> import pickle
>>> tagged_sent = pos_tag('cat / dog'.split())
>>> with open('tagged_sent.pkl', 'wb') as fout:
...     pickle.dump(tagged_sent, fout)
... 
>>> tagged_sent = None
>>> tagged_sent
>>> with open('tagged_sent.pkl', 'rb') as fin:
...     tagged_sent = pickle.load(fin)
... 
>>> tagged_sent
[('cat', 'NN'), ('/', 'CD'), ('dog', 'NN')]

2021-01-20