我有一个数据框如下。
ID Word Synonyms ------------------------ 1 drove drive 2 office downtown 3 everyday daily 4 day daily 5 work downtown
我正在阅读一个句子,并想用上面定义的同义词替换该句子中的单词。这是我的代码:
import nltk import pandas as pd import string sdf = pd.read_excel('C:\synonyms.xlsx') sd = sdf.apply(lambda x: x.astype(str).str.lower()) words = 'i drove to office everyday in my car' ####### def tokenize(text): text = ''.join([ch for ch in text if ch not in string.punctuation]) tokens = nltk.word_tokenize(text) synonym = synonyms(tokens) return synonym def synonyms(words): for word in words: if(sd[sd['Word'] == word].index.tolist()): idx = sd[sd['Word'] == word].index.tolist() word = sd.loc[idx]['Synonyms'].item() else: word return word print(tokenize(words))
上面的代码将输入句子标记化。我想实现以下输出:
进 :i drove to office everyday in my car 出 :i drive to downtown daily in my car
i drove to office everyday in my car
i drive to downtown daily in my car
但是我得到的输出是
出 :car
car
如果我跳过该synonyms函数,那么我的输出将没有问题,并且将分成单个单词。我试图了解我在synonyms函数中做错了什么。另外,请告知是否有更好的解决方案。
synonyms
我会利用Pandas / NumPy索引。由于您的同义词映射是多对一的,因此您可以使用该Word列重新编制索引。
Word
sd = sd.applymap(str.strip).applymap(str.lower).set_index('Word').Synonyms print(sd) Word drove drive office downtown everyday daily day daily Name: Synonyms, dtype: object
然后,您可以轻松地将标记列表与其各自的同义词对齐。
words = nltk.word_tokenize(u'i drove to office everyday in my car') sentence = sd[words].reset_index() print(sentence) Word Synonyms 0 i NaN 1 drove drive 2 to NaN 3 office downtown 4 everyday daily 5 in NaN 6 my NaN 7 car NaN
现在,仍然可以使用的令牌Synonyms,回溯到Word。这可以通过以下方式实现
Synonyms
sentence = sentence.Synonyms.fillna(sentence.Word) print(sentence.values) [u'i' 'drive' u'to' 'downtown' 'daily' u'in' u'my' u'car']