NumPy：向量化为另一个数组中的每个元素在数组中找到最接近的值

小编典典

NumPy：向量化为另一个数组中的每个元素在数组中找到最接近的值

algorithm

输入项

known_array：numpy数组; 仅由标量值组成；shape: (m, 1)

test_array：numpy数组; 仅由标量值组成；shape: (n, 1)

输出量

indices：numpy数组; shape: (n, 1); 对于in
test_array中的每个值，查找in中最接近的值的索引known_array

residual：numpy数组; shape: (n, 1); 对于in
test_array中的每个值，查找与in中最接近的值的差known_array

例

In [17]: known_array = np.array([random.randint(-30,30) for i in range(5)])

In [18]: known_array
Out[18]: array([-24, -18, -13, -30,  29])

In [19]: test_array = np.array([random.randint(-10,10) for i in range(10)])

In [20]: test_array
Out[20]: array([-6,  4, -6,  4,  8, -4,  8, -6,  2,  8])

示例实现（未完全向量化）

def find_nearest(known_array, value):
    idx = (np.abs(known_array - value)).argmin()
    diff = known_array[idx] - value
    return [idx, -diff]

In [22]: indices = np.zeros(len(test_array))

In [23]: residual = np.zeros(len(test_array))

In [24]: for i in range(len(test_array)):
   ....:     [indices[i], residual[i]] =  find_nearest(known_array, test_array[i])
   ....:

In [25]: indices
Out[25]: array([ 2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.])

In [26]: residual
Out[26]: array([  7.,  17.,   7.,  17.,  21.,   9.,  21.,   7.,  15.,  21.])

加快此任务的最佳方法是什么？Cython是一个选项，但是，我始终希望能够删除for循环并让代码保留为纯NumPy。

更新

我做了一些小的基准测试来比较非矢量化和矢量化解决方案（可接受的答案）。

In [48]: [indices1, residual1] = find_nearest_vectorized(known_array, test_array)

In [53]: [indices2, residual2] = find_nearest_non_vectorized(known_array, test_array)

In [54]: indices1==indices2
Out[54]: array([ True,  True,  True,  True,  True,  True,  True,  True,  True,  True],   dtype=bool)

In [55]: residual1==residual2
Out[55]: array([ True,  True,  True,  True,  True,  True,  True,  True,  True,  True], dtype=bool)

In [56]: %timeit [indices2, residual2] = find_nearest_non_vectorized(known_array, test_array)
10000 loops, best of 3: 173 µs per loop

In [57]: %timeit [indices1, residual1] = find_nearest_vectorized(known_array, test_array)
100000 loops, best of 3: 16.8 µs per loop

加速约 10倍 ！

澄清度

known_array 未排序。

我按照下面@cyborg的答案运行了基准测试。

情况1 ：如果known_array已排序

known_array = np.arange(0,1000)
test_array = np.random.randint(0, 100, 10000)
print('Speedups:')
base_time = time_f('base')
for func_name in ['diffs', 'searchsorted1', 'searchsorted2']:
    print func_name + ' is x%.1f faster than base.' % (base_time / time_f(func_name))
    assert np.allclose(base(known_array, test_array), eval(func_name+'(known_array, test_array)'))

Speedups:
diffs is x0.4 faster than base.
searchsorted1 is x81.3 faster than base.
searchsorted2 is x107.6 faster than base.

首先，对于大型数组，diffs方法实际上要慢一些，它还会占用大量RAM，而当我在实际数据上运行它时，系统就会挂起。

情况2 ：何时known_array未排序；代表实际情况

known_array = np.random.randint(0,100,100)
test_array = np.random.randint(0, 100, 100)

Speedups:
diffs is x8.9 faster than base.
AssertionError                            Traceback (most recent call last)
<ipython-input-26-3170078c217a> in <module>()
      5 for func_name in ['diffs', 'searchsorted1', 'searchsorted2']:
      6     print func_name + ' is x%.1f faster than base.' % (base_time /  time_f(func_name))
----> 7     assert np.allclose(base(known_array, test_array),  eval(func_name+'(known_array, test_array)'))

AssertionError:


searchsorted1 is x14.8 faster than base.

我还必须评论说，该方法还应该具有存储效率。否则我的8 GB RAM不足。在基本情况下，这很容易满足。

阅读 503

2020-07-28

共1个答案

小编典典

例如，您可以计算使用中的所有差异：

differences = (test_array.reshape(1,-1) - known_array.reshape(-1,1))

以及使用argmin和花式索引以及np.diagonal获得所需的索引和差异：

indices = np.abs(differences).argmin(axis=0)
residual = np.diagonal(differences[indices,])

因此对于

>>> known_array = np.array([-24, -18, -13, -30,  29])
>>> test_array = np.array([-6,  4, -6,  4,  8, -4,  8, -6,  2,  8])

一送一

>>> indices
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
>>> residual
array([ 7, 17,  7, 17, 21,  9, 21,  7, 15, 21])

2020-07-28