我目前正在尝试处理实验性时间序列数据集,该数据集缺少值。我想在处理nan值的同时计算该数据集随时间的滑动窗口平均值。对我而言,正确的方法是在每个窗口内计算有限元素的总和,然后将其除以它们的数量。这种非线性迫使我使用非卷积方法来面对这个问题,因此在该过程的这一部分中我遇到了严重的时间瓶颈。作为我要完成的工作的代码示例,我提出以下内容:
import numpy as np #Construct sample data n = 50 n_miss = 20 win_size = 3 data= np.random.random(50) data[np.random.randint(0,n-1, n_miss)] = None #Compute mean result = np.zeros(data.size) for count in range(data.size): part_data = data[max(count - (win_size - 1) / 2, 0): min(count + (win_size + 1) / 2, data.size)] mask = np.isfinite(part_data) if np.sum(mask) != 0: result[count] = np.sum(part_data[mask]) / np.sum(mask) else: result[count] = None print 'Input:\t',data print 'Output:\t',result
输出:
Input: [ 0.47431791 0.17620835 0.78495647 0.79894688 0.58334064 0.38068788 0.87829696 nan 0.71589171 nan 0.70359557 0.76113969 0.13694387 0.32126573 0.22730891 nan 0.35057169 nan 0.89251851 0.56226354 0.040117 nan 0.37249799 0.77625334 nan nan nan nan 0.63227417 0.92781944 0.99416471 0.81850753 0.35004997 nan 0.80743783 0.60828597 nan 0.01410721 nan nan 0.6976317 nan 0.03875394 0.60924066 0.22998065 nan 0.34476729 0.38090961 nan 0.2021964 ] Output: [ 0.32526313 0.47849424 0.5867039 0.72241466 0.58765847 0.61410849 0.62949242 0.79709433 0.71589171 0.70974364 0.73236763 0.53389305 0.40644977 0.22850617 0.27428732 0.2889403 0.35057169 0.6215451 0.72739103 0.49829968 0.30119027 0.20630749 0.57437567 0.57437567 0.77625334 nan nan 0.63227417 0.7800468 0.85141944 0.91349722 0.7209074 0.58427875 0.5787439 0.7078619 0.7078619 0.31119659 0.01410721 0.01410721 0.6976317 0.6976317 0.36819282 0.3239973 0.29265842 0.41961066 0.28737397 0.36283845 0.36283845 0.29155301 0.2021964 ]
可以在不使用for循环的情况下通过numpy操作产生此结果吗?
这是基于卷积的方法,使用np.convolve-
np.convolve
mask = np.isnan(data) K = np.ones(win_size,dtype=int) out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)
请注意,这将在两侧各增加一个元素。
如果您正在处理2D数据,我们可以使用Scipy's 2D convolution。
2D
Scipy's 2D convolution
方法-
def original_app(data, win_size): #Compute mean result = np.zeros(data.size) for count in range(data.size): part_data = data[max(count - (win_size - 1) / 2, 0): \ min(count + (win_size + 1) / 2, data.size)] mask = np.isfinite(part_data) if np.sum(mask) != 0: result[count] = np.sum(part_data[mask]) / np.sum(mask) else: result[count] = None return result def numpy_app(data, win_size): mask = np.isnan(data) K = np.ones(win_size,dtype=int) out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K) return out[1:-1] # Slice out the one-extra elems on sides
样品运行-
In [118]: #Construct sample data ...: n = 50 ...: n_miss = 20 ...: win_size = 3 ...: data= np.random.random(50) ...: data[np.random.randint(0,n-1, n_miss)] = np.nan ...: In [119]: original_app(data, win_size = 3) Out[119]: array([ 0.88356487, 0.86829731, 0.85249541, 0.83776219, nan, nan, 0.61054015, 0.63111926, 0.63111926, 0.65169837, 0.1857301 , 0.58335324, 0.42088104, 0.5384565 , 0.31027752, 0.40768907, 0.3478563 , 0.34089655, 0.55462903, 0.71784816, 0.93195716, nan, 0.41635575, 0.52211653, 0.65053379, 0.76762282, 0.72888574, 0.35250449, 0.35250449, 0.14500637, 0.06997668, 0.22582318, 0.18621848, 0.36320784, 0.19926647, 0.24506199, 0.09983572, 0.47595439, 0.79792941, 0.5982114 , 0.42389375, 0.28944089, 0.36246113, 0.48088139, 0.71105449, 0.60234163, 0.40012839, 0.45100475, 0.41768466, 0.41768466]) In [120]: numpy_app(data, win_size = 3) __main__:36: RuntimeWarning: invalid value encountered in divide Out[120]: array([ 0.88356487, 0.86829731, 0.85249541, 0.83776219, nan, nan, 0.61054015, 0.63111926, 0.63111926, 0.65169837, 0.1857301 , 0.58335324, 0.42088104, 0.5384565 , 0.31027752, 0.40768907, 0.3478563 , 0.34089655, 0.55462903, 0.71784816, 0.93195716, nan, 0.41635575, 0.52211653, 0.65053379, 0.76762282, 0.72888574, 0.35250449, 0.35250449, 0.14500637, 0.06997668, 0.22582318, 0.18621848, 0.36320784, 0.19926647, 0.24506199, 0.09983572, 0.47595439, 0.79792941, 0.5982114 , 0.42389375, 0.28944089, 0.36246113, 0.48088139, 0.71105449, 0.60234163, 0.40012839, 0.45100475, 0.41768466, 0.41768466])
运行时测试-
In [122]: #Construct sample data ...: n = 50000 ...: n_miss = 20000 ...: win_size = 3 ...: data= np.random.random(n) ...: data[np.random.randint(0,n-1, n_miss)] = np.nan ...: In [123]: %timeit original_app(data, win_size = 3) 1 loops, best of 3: 1.51 s per loop In [124]: %timeit numpy_app(data, win_size = 3) 1000 loops, best of 3: 1.09 ms per loop In [125]: import pandas as pd # @jdehesa's pandas solution In [126]: %timeit pd.Series(data).rolling(window=3, min_periods=1).mean() 100 loops, best of 3: 3.34 ms per loop