我有两个DataFrames(带有DatetimeIndex),并想用第二个帧(较新的)中的数据更新第一个帧(较旧的)。
DataFrames
DatetimeIndex
新框架可能包含旧框架中已经包含的行的最新数据。在这种情况下,旧框架中的数据应被新框架中的数据覆盖。同样,较新的框架可能比第一个框架具有更多的列/行。在这种情况下,旧帧应被新帧中的数据放大。
熊猫文档指出,
“.loc/.ix/[]为该轴设置不存在的键时,操作可以执行放大”
.loc/.ix/[]
和
“ DataFrame可以通过.loc“
.loc
但是,这似乎不起作用,并引发了KeyError。例:
KeyError
In [195]: df1 Out[195]: A B C 2015-07-09 12:00:00 1 1 1 2015-07-09 13:00:00 1 1 1 2015-07-09 14:00:00 1 1 1 2015-07-09 15:00:00 1 1 1 In [196]: df2 Out[196]: A B C D 2015-07-09 14:00:00 2 2 2 2 2015-07-09 15:00:00 2 2 2 2 2015-07-09 16:00:00 2 2 2 2 2015-07-09 17:00:00 2 2 2 2 In [197]: df1.loc[df2.index] = df2 --------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-197-74e630e87cf8> in <module>() ----> 1 df1.loc[df2.index] = df2 /.../pandas/core/indexing.pyc in __setitem__(self, key, value) 112 113 def __setitem__(self, key, value): --> 114 indexer = self._get_setitem_indexer(key) 115 self._setitem_with_indexer(indexer, value) 116 /.../pandas/core/indexing.pyc in _get_setitem_indexer(self, key) 107 108 try: --> 109 return self._convert_to_indexer(key, is_setter=True) 110 except TypeError: 111 raise IndexingError(key) /.../pandas/core/indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter) 1110 mask = check == -1 1111 if mask.any(): -> 1112 raise KeyError('%s not in index' % objarr[mask]) 1113 1114 return _values_from_object(indexer) KeyError: "['2015-07-09T18:00:00.000000000+0200' '2015-07-09T19:00:00.000000000+0200'] not in index"
最好的方法(就性能而言,因为我的实际数据要大得多)是什么,两种方法都可以实现所需的更新和扩大的DataFrame。这是我希望看到的结果:
A B C D 2015-07-09 12:00:00 1 1 1 NaN 2015-07-09 13:00:00 1 1 1 NaN 2015-07-09 14:00:00 2 2 2 2 2015-07-09 15:00:00 2 2 2 2 2015-07-09 16:00:00 2 2 2 2 2015-07-09 17:00:00 2 2 2 2
df2.combine_first(df1)(文档)似乎可以满足您的要求;PFB代码段和输出
df2.combine_first(df1)
import pandas as pd print 'pandas-version: ', pd.__version__ df1 = pd.DataFrame.from_records([('2015-07-09 12:00:00',1,1,1), ('2015-07-09 13:00:00',1,1,1), ('2015-07-09 14:00:00',1,1,1), ('2015-07-09 15:00:00',1,1,1)], columns=['Dt', 'A', 'B', 'C']).set_index('Dt') # print df1 df2 = pd.DataFrame.from_records([('2015-07-09 14:00:00',2,2,2,2), ('2015-07-09 15:00:00',2,2,2,2), ('2015-07-09 16:00:00',2,2,2,2), ('2015-07-09 17:00:00',2,2,2,2),], columns=['Dt', 'A', 'B', 'C', 'D']).set_index('Dt') res_combine1st = df2.combine_first(df1) print res_combine1st
pandas-version: 0.15.2 A B C D Dt 2015-07-09 12:00:00 1 1 1 NaN 2015-07-09 13:00:00 1 1 1 NaN 2015-07-09 14:00:00 2 2 2 2 2015-07-09 15:00:00 2 2 2 2 2015-07-09 16:00:00 2 2 2 2 2015-07-09 17:00:00 2 2 2 2