我试图创建下面的正则表达式:返回之间的字符串AUG和(UAG或UGA或UAA)从下列字符串RNA: AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG,让所有的比赛会被发现,包括重叠的。
AUG
UAG
UGA
UAA
AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG
我尝试了几种正则表达式,最后得到了类似的结果:
matches = re.findall('(?=AUG)(\w+)(?=UAG|UGA|UAA)',"AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG")
您能告诉我我的正则表达式模式中的错误吗?
用一个正则表达式执行此操作实际上是非常困难的,因为大多数用法 都不 希望重叠匹配。但是,您可以通过一些简单的迭代来做到这一点:
regex = re.compile('(?=AUG)(\w+)(?=UAG|UGA|UAA)'); RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG' matches = [] tmp = RNA while (match = regex.search(tmp)): matches.append(match) tmp = tmp[match.start()-2:] #Back up two to get the UG portion. Shouldn't matter, but safer. for m in matches: print m.group(0)
虽然,这有一些问题。您希望得到的回报是AUGUAGUGAUAA什么?是否有两个字符串要返回?还是一个?目前,您的正则表达式甚至无法捕获UAG,因为它会一直匹配UAGUGA并被截断UAA。为了解决这个问题,您可能希望使用?运算符使您的运算符变得很懒惰- 这种方法随后将无法捕获更长的子字符串。
AUGUAGUGAUAA
UAGUGA
?
也许对字符串进行两次迭代是答案,但是如果您的RNA序列包含该AUGAUGUAGUGAUAA怎么办?那里的正确行为是什么?
AUGAUGUAGUGAUAA
通过遍历字符串及其子字符串,我可能更喜欢无正则表达式的方法:
RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG' candidates = [] start = 0 while (RNA.find('AUG', start) > -1): start = RNA.find('AUG', start) #Confound python and its lack of assignment returns candidates.append(RNA[start+3:]) start += 1 matches = [] for candidate in candidates: for terminator in ['UAG', 'UGA', 'UAA']: end = 1; while(candidate.find(terminator, end) > -1): end = candidate.find(terminator, end) matches.append(candidate[:end]) end += 1 for match in matches: print match
这样,无论如何,您都可以确保获得所有匹配项。
如果需要跟踪每个比赛的位置,则可以修改候选数据结构以使用元组来保持起始位置:
RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG' candidates = [] start = 0 while (RNA.find('AUG', start) > -1): start = RNA.find('AUG', start) #Confound python and its lack of assignment returns candidates.append((RNA[start+3:], start+3)) start += 1 matches = [] for candidate in candidates: for terminator in ['UAG', 'UGA', 'UAA']: end = 1; while(candidate[0].find(terminator, end) > -1): end = candidate[0].find(terminator, end) matches.append((candidate[1], candidate[1] + end, candidate[0][:end])) end += 1 for match in matches: print "%d - %d: %s" % match
打印:
7 - 49: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAU 7 - 85: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG 7 - 31: UAGCUAACUCAGGUUACAUGGGGA 7 - 72: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC 7 - 76: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA 7 - 11: UAGC 7 - 66: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA 27 - 49: GGGAUGACCCCGCGACUUGGAU 27 - 85: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG 27 - 31: GGGA 27 - 72: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC 27 - 76: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA 27 - 66: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA 33 - 49: ACCCCGCGACUUGGAU 33 - 85: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG 33 - 72: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC 33 - 76: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA 33 - 66: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA 78 - 85: AUCCGAG
地狱,再加上三行,您甚至可以根据匹配在RNA序列中的位置对它们进行排序:
from operator import itemgetter matches.sort(key=itemgetter(1)) matches.sort(key=itemgetter(0))
最终印刷版前面的内容可以使您:
007 - 011: UAGC 007 - 031: UAGCUAACUCAGGUUACAUGGGGA 007 - 049: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAU 007 - 066: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA 007 - 072: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC 007 - 076: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA 007 - 085: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG 027 - 031: GGGA 027 - 049: GGGAUGACCCCGCGACUUGGAU 027 - 066: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA 027 - 072: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC 027 - 076: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA 027 - 085: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG 033 - 049: ACCCCGCGACUUGGAU 033 - 066: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA 033 - 072: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC 033 - 076: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA 033 - 085: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG 078 - 085: AUCCGAG