我正在设计一个正则表达式,以将给定 文本中的* 所有 实际单词 分开 : ***
输入示例:
"John's mom went there, but he wasn't there. So she said: 'Where are you'"
预期产量:
["John's", "mom", "went", "there", "but", "he", "wasn't", "there", "So", "she", "said", "Where", "are", "you"]
我想到了这样的正则表达式:
"(([^a-zA-Z]+')|('[^a-zA-Z]+))|([^a-zA-Z']+)"
在Python中分割后,结果包含None项目和空白。
None
如何摆脱无物品?为何空间不匹配?
编辑: 在空格上分割,将得到像:["there."] 在非字母上分割,将得到像:["John","s"] 在非字母上分割,除了',将得到像:["'Where","you'"]
["there."]
["John","s"]
'
["'Where","you'"]
可以使用字符串函数代替正则表达式:
to_be_removed = ".,:!" # all characters to be removed s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'" for c in to_be_removed: s = s.replace(c, '') s.split()
但是 ,在您的示例中,您不想删除其中的撇号,John's但希望将其删除you!!'。因此,字符串操作在这一点上会失败,因此您需要精细调整的正则表达式。
John's
you!!'
编辑:可能一个简单的正则表达式可以解决您的问题:
(\w[\w']*)
它将捕获以字母开头的所有字符,并在下一个字符为撇号或字母时继续捕获。
(\w[\w']*\w)
第二个正则表达式适用于非常特殊的情况。…第一个正则表达式可以捕获类似的单词you'。这将避免此情况,并且仅在单词内(而不是开头或结尾)时才捕获撇号。但是在这一点上,情况出现了,您无法Moss' mom使用第二个正则表达式捕获撇号。你必须决定是否将捕获尾随结束机智名撇号 小号 和界定所有权。
you'
Moss' mom
例:
rgx = re.compile("([\w][\w']*\w)") s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'" rgx.findall(s) ["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you']
更新2:我在正则表达式中发现了一个错误!它不能捕获单个字母,后跟撇号A'。固定的新正则表达式在这里:
A'
(\w[\w']*\w|\w) rgx = re.compile("(\w[\w']*\w|\w)") s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'" rgx.findall(s) ["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', 'a']