将多个过滤器应用于 pandas DataFrame 或 Series 的有效方法

小编典典

将多个过滤器应用于 pandas DataFrame 或 Series 的有效方法

all

我有一个场景，用户想要 将多个过滤器应用于 Pandas DataFrame 或 Series 对象
。本质上，我想有效地将用户在运行时指定的一堆过滤（比较操作）链接在一起。

过滤器应该是附加的（也就是每个应用都应该缩小结果）。
我目前正在使用reindex()（如下所示），但这每次都会创建一个新对象并复制基础数据（如果我正确理解文档）。我想避免这种不必要的复制，因为在过滤大系列或 DataFrame 时效率非常低。
我认为使用apply(),map()或类似的东西可能会更好。尽管我对 Pandas 还是很陌生，但我仍然想把所有东西都包起来。
另外，我想扩展它，以便 传入的字典可以包含要操作的列， 并根据输入字典过滤整个 DataFrame。但是，我假设适用于 Series 的任何内容都可以轻松扩展为 DataFrame。

TL;博士

我想采用以下形式的字典并将每个操作应用于给定的 Series 对象并返回一个“过滤”的 Series 对象。

relops = {'>=': [1], '<=': [1]}

长示例

我将从一个我目前拥有的示例开始，并且只过滤一个 Series 对象。以下是我目前正在使用的功能：

   def apply_relops(series, relops):
        """
        Pass dictionary of relational operators to perform on given series object
        """
        for op, vals in relops.iteritems():
            op_func = ops[op]
            for val in vals:
                filtered = op_func(series, val)
                series = series.reindex(series[filtered])
        return series

用户提供了一个字典，其中包含他们想要执行的操作：

>>> df = pandas.DataFrame({'col1': [0, 1, 2], 'col2': [10, 11, 12]})
>>> print df
>>> print df
   col1  col2
0     0    10
1     1    11
2     2    12

>>> from operator import le, ge
>>> ops ={'>=': ge, '<=': le}
>>> apply_relops(df['col1'], {'>=': [1]})
col1
1       1
2       2
Name: col1
>>> apply_relops(df['col1'], relops = {'>=': [1], '<=': [1]})
col1
1       1
Name: col1

同样，我上述方法的“问题”是我认为中间步骤可能有很多不必要的数据复制。

阅读 60

2022-06-21

共1个答案

小编典典

Pandas（和 numpy）允许boolean indexing，这将更有效率：

In [11]: df.loc[df['col1'] >= 1, 'col1']
Out[11]: 
1    1
2    2
Name: col1

In [12]: df[df['col1'] >= 1]
Out[12]: 
   col1  col2
1     1    11
2     2    12

In [13]: df[(df['col1'] >= 1) & (df['col1'] <=1 )]
Out[13]: 
   col1  col2
1     1    11

如果您想为此编写辅助函数，请考虑以下内容：

In [14]: def b(x, col, op, n): 
             return op(x[col],n)

In [15]: def f(x, *b):
             return x[(np.logical_and(*b))]

In [16]: b1 = b(df, 'col1', ge, 1)

In [17]: b2 = b(df, 'col1', le, 1)

In [18]: f(df, b1, b2)
Out[18]: 
   col1  col2
1     1    11

更新：pandas 0.13对这些用例有一个查询方法，假设列名是有效的标识符，以下工作（并且对于大帧可能更有效，因为它在幕后使用numexpr
）：

In [21]: df.query('col1 <= 1 & 1 <= col1')
Out[21]:
   col1  col2
1     1    11

2022-06-21