我是python编程的新手。我正在尝试使用具有两列字符串值的csv文件,并希望比较两列之间字符串的相似率。然后,我想取值并在另一个文件中输出比率。
csv可能看起来像这样:
Column 1|Column 2 tomato|tomatoe potato|potatao apple|appel
我希望输出文件针对每一行显示,第1列中的字符串与第2列中的字符串有多相似。我正在使用difflib输出比率得分。
这是我到目前为止的代码:
import csv import difflib f = open('test.csv') csf_f = csv.reader(f) row_a = [] row_b = [] for row in csf_f: row_a.append(row[0]) row_b.append(row[1]) a = row_a b = row_b def similar(a, b): return difflib.SequenceMatcher(a, b).ratio() match_ratio = similar(a, b) match_list = [] for row in match_ratio: match_list.append(row) with open("output.csv", "wb") as f: writer = csv.writer(f, delimiter=',') writer.writerows(match_list) f.close()
我得到错误:
Traceback (most recent call last): File "comparison.py", line 24, in <module> for row in match_ratio: TypeError: 'float' object is not iterable
我觉得我没有正确导入列列表,并针对sequencematcher函数运行它。
这是使用以下方法完成此操作的另一种方法pandas:
pandas
考虑您的csv数据是这样的:
Column 1,Column 2 tomato,tomatoe potato,potatao apple,appel
码
import pandas as pd import difflib as diff #Read the CSV df = pd.read_csv('datac.csv') #Create a new column 'diff' and get the result of comparision to it df['diff'] = df.apply(lambda x: diff.SequenceMatcher(None, x[0].strip(), x[1].strip()).ratio(), axis=1) #Save the dataframe to CSV and you could also save it in other formats like excel, html etc df.to_csv('outdata.csv',index=False)
结果
Column 1,Column 2 ,diff tomato,tomatoe ,0.923076923077 potato,potatao ,0.923076923077 apple,appel ,0.8