为什么用Python捕获组会导致正则表达式搜索变慢？

小编典典

为什么用Python捕获组会导致正则表达式搜索变慢？

python

我有一个应用程序代码，可以从配置中动态生成正则表达式以进行某些解析。当计时两个变体的性能时，捕获到OR正则表达式各部分的正则表达式变体比正常正则表达式明显慢。原因将是正则表达式模块内部某些操作的开销。

>>> import timeit
>>> setup = '''
... import re
... '''

#no capture group 
>>> print(timeit.timeit("re.search(r'hello|bye|ola|cheers','some say hello,some say bye, or ola or cheers!')", setup=setup))
0.922958850861

#with capture group
>>> print(timeit.timeit("re.search(r'(hello)|(bye)|(ola)|(cheers)','some say hello,some say bye, or ola or cheers!')", setup=setup))
1.44321084023

#no capture group
>>> print(timeit.timeit("re.search(r'hello|bye|ola|cheers','some say hello,some say bye, or ola or cheers!')", setup=setup))
0.913202047348

# capture group
>>> print(timeit.timeit("re.search(r'(hello)|(bye)|(ola)|(cheers)','some say hello,some say bye, or ola or cheers!')", setup=setup))
1.41544604301

问题： 是什么原因导致使用捕获组时性能显着下降？

阅读 225

2021-01-20

共1个答案

小编典典

您的模式仅在捕获组中有所不同。当您在正则表达式模式中定义捕获组并将该模式与结合使用时re.search，结果将是一个
MatchObject实例。每个匹配对象将包含与模式中捕获组一样多的
组，即使它们为空。这是re内部组件的开销：添加组（的列表）（内存分配等）。请注意，组还包含诸如
它们所匹配的文本的开始和结束索引
等详细信息（请参阅MatchObject参考资料）。

2021-01-20