NLTK正则表达式令牌生成器在正则表达式中不能与小数点配合良好

小编典典

NLTK正则表达式令牌生成器在正则表达式中不能与小数点配合良好

python

我正在尝试编写文本规范化程序，需要处理的一种基本情况是将类似的东西3.14变成three point one four或three pointfourteen。

我目前正在将模式\$?\d+(\.\d+)?%?与一起使用nltk.regexp_tokenize，我认为它应该处理数字以及货币和百分比。但是，此刻，类似的东西$23.50已被完美处理（解析为['$23.50']），但3.14解析为['3','14']-小数点被删除了。

我尝试将一个模式单独添加\d+.\d+到我的正则表达式中，但这无济于事（并且我当前的模式不应该已经匹配吗？）

编辑2
：我也刚刚发现该%零件似乎也不能正常工作-20%只是返回['20']。我觉得我的正则表达式一定有问题，但是我已经在Pythex中对其进行了测试，看起来还好吗？

编辑：这是我的代码。

import nltk
import re

pattern = r'''(?x)    # set flag to allow verbose regexps
            ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
            | \w+([-']\w+)*        # words w/ optional internal hyphens/apostrophe
            | \$?\d+(\.\d+)?%?  # numbers, incl. currency and percentages
            | [+/\-@&*]         # special characters with meanings
            '''
    words = nltk.regexp_tokenize(line, pattern)
    words = [string.lower(w) for w in words]
    print words

这是我的一些测试字符串：

32188
2598473
26 letters from A to Z
3.14 is pi.                         <-- ['3', '14', 'is', 'pi']
My weight is about 68 kg, +/- 10 grams.
Good muffins cost $3.88 in New York <-- ['good', 'muffins', 'cost', '$3.88', 'in', 'new', 'york']

阅读 218

2021-01-20

共1个答案

小编典典

罪魁祸首是：

\w+([-']\w+)*

\w+将匹配数量和因为没有.出现，这将只匹配3在3.14。将选项稍微移动一点，使其\$?\d+(\.\d+)?%?在上述正则表达式部分之前（以便首先尝试在数字格式上进行匹配）：

(?x)([A-Z]\.)+|\$?\d+(\.\d+)?%?|\w+([-']\w+)*|[+/\-@&*]

regex101演示

或以扩展形式：

pattern = r'''(?x)               # set flag to allow verbose regexps
              ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
              | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
              | \w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
              | [+/\-@&*]        # special characters with meanings
            '''

2021-01-20