我试图突出显示两个数据框之间到底发生了什么变化。
假设我有两个Python Pandas数据框:
"StudentRoster Jan-1": id Name score isEnrolled Comment 111 Jack 2.17 True He was late to class 112 Nick 1.11 False Graduated 113 Zoe 4.12 True "StudentRoster Jan-2": id Name score isEnrolled Comment 111 Jack 2.17 True He was late to class 112 Nick 1.21 False Graduated 113 Zoe 4.12 False On vacation
我的目标是输出一个HTML表:
"StudentRoster Difference Jan-1 - Jan-2":
id Name score isEnrolled Comment 112 Nick was 1.11| now 1.21 False Graduated 113 Zoe 4.12 was True | now False was “” | now “On vacation”
我想我可以逐行和逐列进行比较,但是有没有更简单的方法?
第一部分类似于君士坦丁,您可以获取其中的行为空的布尔值*:
In [21]: ne = (df1 != df2).any(1) In [22]: ne Out[22]: 0 False 1 True 2 True dtype: bool
然后,我们可以查看哪些条目已更改:
In [23]: ne_stacked = (df1 != df2).stack() In [24]: changed = ne_stacked[ne_stacked] In [25]: changed.index.names = ['id', 'col'] In [26]: changed Out[26]: id col 1 score True 2 isEnrolled True Comment True dtype: bool
在这里,第一个条目是索引,第二个条目是已更改的列。
In [27]: difference_locations = np.where(df1 != df2) In [28]: changed_from = df1.values[difference_locations] In [29]: changed_to = df2.values[difference_locations] In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index) Out[30]: from to id col 1 score 1.11 1.21 2 isEnrolled True False Comment None On vacation
*注:这是非常重要的df1,并df2在这里分享相同的索引。为了克服这种歧义,您可以确保仅使用来查看共享标签df1.index & df2.index,但我想将其保留为练习。
df1
df2
df1.index & df2.index