正则表达式在Python中拆分单词

小编典典

正则表达式在Python中拆分单词

python

我正在设计一个正则表达式，以将给定文本中的* 所有 实际单词 分开： ***

输入示例：

"John's mom went there, but he wasn't there. So she said: 'Where are you'"

预期产量：

["John's", "mom", "went", "there", "but", "he", "wasn't", "there", "So", "she", "said", "Where", "are", "you"]

我想到了这样的正则表达式：

"(([^a-zA-Z]+')|('[^a-zA-Z]+))|([^a-zA-Z']+)"

在Python中分割后，结果包含None项目和空白。

如何摆脱无物品？为何空间不匹配？

编辑：
在空格上分割，将得到像：["there."]
在非字母上分割，将得到像：["John","s"]
在非字母上分割，除了'，将得到像：["'Where","you'"]

阅读 213

2021-01-20

共1个答案

小编典典

可以使用字符串函数代替正则表达式：

to_be_removed = ".,:!" # all characters to be removed
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"

for c in to_be_removed:
    s = s.replace(c, '')
s.split()

但是
，在您的示例中，您不想删除其中的撇号，John's但希望将其删除you!!'。因此，字符串操作在这一点上会失败，因此您需要精细调整的正则表达式。

编辑：可能一个简单的正则表达式可以解决您的问题：

(\w[\w']*)

它将捕获以字母开头的所有字符，并在下一个字符为撇号或字母时继续捕获。

(\w[\w']*\w)

第二个正则表达式适用于非常特殊的情况。…第一个正则表达式可以捕获类似的单词you'。这将避免此情况，并且仅在单词内（而不是开头或结尾）时才捕获撇号。但是在这一点上，情况出现了，您无法Moss' mom使用第二个正则表达式捕获撇号。你必须决定是否将捕获尾随结束机智名撇号小号和界定所有权。

例：

rgx = re.compile("([\w][\w']*\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
rgx.findall(s)

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you']

更新2：我在正则表达式中发现了一个错误！它不能捕获单个字母，后跟撇号A'。固定的新正则表达式在这里：

(\w[\w']*\w|\w)

rgx = re.compile("(\w[\w']*\w|\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
rgx.findall(s)

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', 'a']

2021-01-20