我正在分析一个时间序列,并基于某些条件,我可以挑选出事件 开始 或 结束 的行。此时,我的系列看起来像这样(为简洁起见,我省略了一些重复的值):
import numpy as np import pandas from pandas import Timestamp datadict = {'event': { Timestamp('2010-01-01 00:20:00', tz=None): 'event start', Timestamp('2010-01-01 00:30:00', tz=None): '--', Timestamp('2010-01-01 00:40:00', tz=None): '--', Timestamp('2010-01-01 00:50:00', tz=None): '--', Timestamp('2010-01-01 01:00:00', tz=None): '--', Timestamp('2010-01-01 01:10:00', tz=None): 'event end', Timestamp('2010-01-01 01:20:00', tz=None): '--', Timestamp('2010-01-01 02:20:00', tz=None): '--', Timestamp('2010-01-01 02:30:00', tz=None): 'event start', Timestamp('2010-01-01 02:40:00', tz=None): '--', Timestamp('2010-01-01 02:50:00', tz=None): '--', Timestamp('2010-01-01 03:00:00', tz=None): '--', Timestamp('2010-01-01 03:10:00', tz=None): '--', Timestamp('2010-01-01 03:20:00', tz=None): '--', Timestamp('2010-01-01 03:30:00', tz=None): 'event end', }} data = pandas.DataFrame.from_dict(datadict) event 2010-01-01 00:20:00 event start 2010-01-01 00:30:00 -- 2010-01-01 00:40:00 -- 2010-01-01 00:50:00 -- 2010-01-01 01:00:00 -- 2010-01-01 01:10:00 event end 2010-01-01 01:20:00 -- 2010-01-01 02:20:00 -- 2010-01-01 02:30:00 event start 2010-01-01 02:40:00 -- 2010-01-01 02:50:00 -- 2010-01-01 03:00:00 -- 2010-01-01 03:10:00 -- 2010-01-01 03:20:00 -- 2010-01-01 03:30:00 event end
for
event event number 2010-01-01 00:20:00 event start 1 2010-01-01 00:30:00 -- 1 2010-01-01 00:40:00 -- 1 2010-01-01 00:50:00 -- 1 2010-01-01 01:00:00 -- 1 2010-01-01 01:10:00 event end 1 2010-01-01 01:20:00 -- NA 2010-01-01 02:20:00 -- NA 2010-01-01 02:30:00 event start 2 2010-01-01 02:40:00 -- 2 2010-01-01 02:50:00 -- 2 2010-01-01 03:00:00 -- 2 2010-01-01 03:10:00 -- 2 2010-01-01 03:20:00 -- 2 2010-01-01 03:30:00 event end 2 2010-01-01 03:40:00 -- NA 2010-01-01 03:50:00 -- NA
通过对数据质量的一些乐观假设,我可以获得如下 事件编号 :
table = data[data.event != '--'].reset_index() table['event number'] = 1 + np.floor(table.index / 2) table = table.set_index('index') event event number index 2010-01-01 00:20:00 event start 1 2010-01-01 01:10:00 event end 1 2010-01-01 02:30:00 event start 2 2010-01-01 03:30:00 event end 2
然后join,我可以将其恢复到原始数据框,并fillna使用method='ffill'
join
fillna
method='ffill'
data2 = data.join(table[['event number']]) data2['filled'] = data2['event number'].fillna(method='ffill') event event number filled 2010-01-01 00:20:00 event start 1 1 2010-01-01 00:30:00 -- NaN 1 2010-01-01 00:40:00 -- NaN 1 2010-01-01 00:50:00 -- NaN 1 2010-01-01 01:00:00 -- NaN 1 2010-01-01 01:10:00 event end 1 1 2010-01-01 01:20:00 -- NaN 1 # <- d'oh 2010-01-01 02:20:00 -- NaN 1 # <- d'oh 2010-01-01 02:30:00 event start 2 2 2010-01-01 02:40:00 -- NaN 2 2010-01-01 02:50:00 -- NaN 2 2010-01-01 03:00:00 -- NaN 2 2010-01-01 03:10:00 -- NaN 2 2010-01-01 03:20:00 -- NaN 2 2010-01-01 03:30:00 event end 2 2
如您所见,事件之间的时间(01:20到02:20)与事件#1相关联。
无论如何,有没有跳过这些部分而不循环?
您可以通过查看的数量event start和的累加总和来实现此目的event end:
event start
event end
>>> data['event number'] = (data.event == 'event start').cumsum() >>> data event event number 2010-01-01 00:20:00 event start 1 2010-01-01 00:30:00 -- 1 2010-01-01 00:40:00 -- 1 2010-01-01 00:50:00 -- 1 2010-01-01 01:00:00 -- 1 2010-01-01 01:10:00 event end 1 2010-01-01 01:20:00 -- 1 2010-01-01 02:20:00 -- 1 2010-01-01 02:30:00 event start 2 2010-01-01 02:40:00 -- 2 2010-01-01 02:50:00 -- 2 2010-01-01 03:00:00 -- 2 2010-01-01 03:10:00 -- 2 2010-01-01 03:20:00 -- 2 2010-01-01 03:30:00 event end 2
现在,您只需要设置nan为没有事件即可;但这些位置对应于行的累积累加event start等于的累积累加event end(移动1行)
nan
>>> idx = data['event number'] == (data.event.shift(1) == 'event end').cumsum() >>> data.loc[idx, 'event number'] = np.nan >>> data event event number 2010-01-01 00:20:00 event start 1 2010-01-01 00:30:00 -- 1 2010-01-01 00:40:00 -- 1 2010-01-01 00:50:00 -- 1 2010-01-01 01:00:00 -- 1 2010-01-01 01:10:00 event end 1 2010-01-01 01:20:00 -- NaN 2010-01-01 02:20:00 -- NaN 2010-01-01 02:30:00 event start 2 2010-01-01 02:40:00 -- 2 2010-01-01 02:50:00 -- 2 2010-01-01 03:00:00 -- 2 2010-01-01 03:10:00 -- 2 2010-01-01 03:20:00 -- 2 2010-01-01 03:30:00 event end 2 [15 rows x 2 columns]