我正在努力制作一个Python脚本,该脚本可以找到两个字符串共享的所有n字长子字符串的(最长)长度,而不必考虑尾随的标点符号。给定两个字符串:
“这是一个示例字符串” “这也是一个示例字符串”
“这是一个示例字符串”
“这也是一个示例字符串”
我希望脚本识别这些字符串具有2个共同的单词序列(“ this is”),然后是3个共同的单词序列(“示例字符串”)。这是我目前的方法:
a = "this is a sample string" b = "this is also a sample string" aWords = a.split() bWords = b.split() #create counters to keep track of position in string currentA = 0 currentB = 0 #create counter to keep track of longest sequence of matching words matchStreak = 0 #create a list that contains all of the matchstreaks found matchStreakList = [] #create binary switch to control the use of while loop continueWhileLoop = 1 for word in aWords: currentA += 1 if word == bWords[currentB]: matchStreak += 1 #to avoid index errors, check to make sure we can move forward one unit in the b string before doing so if currentB + 1 < len(bWords): currentB += 1 #in case we have two identical strings, check to see if we're at the end of string a. If we are, append value of match streak to list of match streaks if currentA == len(aWords): matchStreakList.append(matchStreak) elif word != bWords[currentB]: #because the streak is broken, check to see if the streak is >= 1. If it is, append the streak counter to out list of streaks and then reset the counter if matchStreak >= 1: matchStreakList.append(matchStreak) matchStreak = 0 while word != bWords[currentB]: #the two words don't match. If you can move b forward one word, do so, then check for another match if currentB + 1 < len(bWords): currentB += 1 #if you have advanced b all the way to the end of string b, then rewind to the beginning of string b and advance a, looking for more matches elif currentB + 1 == len(bWords): currentB = 0 break if word == bWords[currentB]: matchStreak += 1 #now that you have a match, check to see if you can advance b. If you can, do so. Else, rewind b to the beginning if currentB + 1 < len(bWords): currentB += 1 elif currentB + 1 == len(bWords): #we're at the end of string b. If we are also at the end of string a, check to see if the value of matchStreak >= 1. If so, add matchStreak to matchStreakList if currentA == len(aWords): matchStreakList.append(matchStreak) currentB = 0 break print matchStreakList
该脚本正确输出了公共字长子字符串(2、3)的(最大)长度,并且到目前为止,所有测试都已经这样做了。我的问题是:是否有两个字符串对上述方法不起作用?更重要的是:是否有现成的Python库或众所周知的方法可用于查找两个字符串共享的所有n字长的子字符串的最大长度?
[此问题与最长的公共子字符串问题不同,后者只是我要查找的特殊情况(因为我想查找所有公共子字符串,而不仅仅是最长的公共子字符串)。建议诸如1)聚类分析,2)编辑距离例程以及3)最长的公共序列算法之类的方法可能是合适的方法,但是我没有找到任何可行的解决方案,而我的问题可能稍微容易一些在链接中提到,因为我正在处理由空格限制的单词。]
编辑:
我开始悬赏这个问题。如果这对其他人有帮助,我想澄清几点。首先,@ DhruvPathak在下面建议的有用答案未找到两个字符串共享的所有最大最长n字长的子字符串。例如,假设我们正在分析的两个字符串是:
“他们刚出生的时候都是白色的一尘不染的纸,但是每只鹅毛笔都要raw草并擦干。”
和
“刚出生的时候,你全是白人,一张可爱的,一尘不染的纸;但每只鹅的鹅毛笔都要使你sc草并沾满墨水。”
在这种情况下,最大n字长的子字符串列表(不考虑结尾标点符号)为:
all are white a sheet of spotless paper when first are born but are to be scrawled and blotted by every
使用以下例程:
#import required packages import difflib #define function we'll use to identify matches def matches(first_string,second_string): s = difflib.SequenceMatcher(None, first_string,second_string) match = [first_string[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0] return match a = "They all are white a sheet of spotless paper when they first are born but they are to be scrawled upon and blotted by every goose quill" b = "You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill" a = a.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower() b = b.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower() print matches(a,b)
一个得到输出:
['e', ' all', ' white a sheet of', ' spotless paper when ', 'y', ' first are born but ', 'y', ' are to be scrawled', ' and blotted by every goose', ' quill']
首先,我不确定如何从此列表中选择仅包含整个单词的子字符串。其次,此列表不包括“ are”,即所需的最大最长公共n字长子字符串之一。有没有一种方法可以找到这两个字符串共享的所有最大最长n字长的子字符串(“您都是…”和“它们都是…”)?
这里仍然有歧义,我不想花时间争论它们。但是我想我仍然可以添加一些有用的东西;-)
我写了Python的difflib.SequenceMatcher,并花了 很多 时间来寻找预期情况下快速的方法来找到最长的公共子字符串。从理论上讲,应该使用“后缀树”或相关的“后缀数组”(以“最长公共前缀数组”增强)来完成(如果您想向Google寻求更多信息,引号中的短语为搜索字词)。这些可以解决最坏情况下的线性时间问题。但是,正如有时的情况下,最坏情况下的线性时间的算法是极度复杂和微妙,并受到大的恒定因素- 他们仍然可以还清巨大的,如果给定的语料库将是搜索 很多 次,但是这不是Python的典型案例difflib,它也不像您的案例。
difflib.SequenceMatcher
difflib
无论如何,我在这里的贡献是重写SequenceMatcher的find_longest_match()方法,以返回沿途找到的 所有 (本地)最大匹配项。笔记:
SequenceMatcher
find_longest_match()
我将使用to_words()Raymond Hettinger提供的功能,但不转换为小写字母。转换为小写字母会导致输出与您想要的输出不完全相同。
to_words()
但是,正如我已经在评论中指出的那样,这确实会输出“羽毛笔”,这不在您期望的输出列表中。我不知道为什么不是这样,因为“羽毛笔”确实出现在两个输入中。
这是代码:
import re def to_words(text): 'Break text into a list of words without punctuation' return re.findall(r"[a-zA-Z']+", text) def match(a, b): # Make b the longer list. if len(a) > len(b): a, b = b, a # Map each word of b to a list of indices it occupies. b2j = {} for j, word in enumerate(b): b2j.setdefault(word, []).append(j) j2len = {} nothing = [] unique = set() # set of all results def local_max_at_j(j): # maximum match ends with b[j], with length j2len[j] length = j2len[j] unique.add(" ".join(b[j-length+1: j+1])) # during an iteration of the loop, j2len[j] = length of longest # match ending with b[j] and the previous word in a for word in a: # look at all instances of word in b j2lenget = j2len.get newj2len = {} for j in b2j.get(word, nothing): newj2len[j] = j2lenget(j-1, 0) + 1 # which indices have not been extended? those are # (local) maximums for j in j2len: if j+1 not in newj2len: local_max_at_j(j) j2len = newj2len # and we may also have local maximums ending at the last word for j in j2len: local_max_at_j(j) return unique
然后:
a = "They all are white a sheet of spotless paper " \ "when they first are born but they are to be " \ "scrawled upon and blotted by every goose quill" b = "You are all white, a sheet of lovely, spotless " \ "paper, when you first are born; but you are to " \ "be scrawled and blotted by every goose's quill" print match(to_words(a), to_words(b))
显示:
set(['all', 'and blotted by every', 'first are born but', 'are to be scrawled', 'are', 'spotless paper when', 'white a sheet of', 'quill'])
编辑-工作原理
最好将许多序列匹配和比对算法理解为在二维矩阵上工作,并具有用于计算矩阵条目并随后解释条目含义的规则。
对于输入序列a和b,图像的矩阵M与len(a)行和len(b)列。在此应用程序中,我们要M[i, j]包含以a[i]和结尾的最长公共连续子序列的长度b[j],并且计算规则 非常 简单:
a
b
M
len(a)
len(b)
M[i, j]
a[i]
b[j]
M[i, j] = 0
a[i] != b[j]
M[i, j] = M[i-1, j-1] + 1
a[i] == b[j]
解读也很容易在这种情况下:有一个当地最大 的非空 在结束比赛a[i]和b[j],长的M[i, j],当且仅当M[i, j]是非零但M[i+1, j+1]为0或出界外。
M[i+1, j+1]
您可以使用这些规则编写带有两个循环的非常简单且紧凑的代码,以M正确计算此问题。缺点是代码将花费(最佳,平均和最差情况)O(len(a) * len(b))时间 和 空间。
O(len(a) * len(b))
虽然一开始可能令人困惑,但我发布的代码完全符合上述要求。连接被遮盖了,因为针对预期情况,代码以多种方式进行了大幅优化:
而不是进行一次遍历来计算M,而是进行另一遍来解释结果,计算和解释在一次遍历中交织在一起a。
因此,不需要存储整个矩阵。而是仅同时显示当前行(newj2len)和上一行(j2len)。
newj2len
j2len
并且由于此问题中的矩阵 通常通常 为零,因此通过将列索引映射到非零值的dict稀疏地表示此处的行。零条目是“免费的”,因为它们从未显式存储。
处理一行时,无需遍历每列:预先计算的b2jdict可以准确地告诉我们当前行中有趣的列索引(那些与wordfrom 中的当前列匹配的列a)。
b2j
word
最后,部分是偶然的,所有先前的优化都以一种这样的方式进行合谋,即无需知道当前行的索引,因此我们也不必费心计算。
编辑-简单的污垢版本
这是直接实现2D矩阵的代码,没有进行任何优化尝试(否则a Counter通常可以避免显式存储0项)。这非常简单,简短又容易:
Counter
def match(a, b): from collections import Counter M = Counter() for i in range(len(a)): for j in range(len(b)): if a[i] == b[j]: M[i, j] = M[i-1, j-1] + 1 unique = set() for i in range(len(a)): for j in range(len(b)): if M[i, j] and not M[i+1, j+1]: length = M[i, j] unique.add(" ".join(a[i+1-length: i+1])) return unique
当然,;-)返回的结果与match()我最初发布的优化结果相同。
match()
编辑-另一个没有字典
只是为了好玩:-)如果您对矩阵模型有所了解,那么此代码将很容易理解。关于此特定问题的一个值得注意的事情是,矩阵像元的值仅取决于与像元西北方向对角线的值。因此,它足以“穿越”所有主要的对角线,从西边界和北边界的所有像元向东南延伸。这样,无论输入的长度如何,都只需要很小的恒定空间:
def match(a, b): from itertools import chain m, n = len(a), len(b) unique = set() for i, j in chain(((i, 0) for i in xrange(m)), ((0, j) for j in xrange(1, n))): k = 0 while i < m and j < n: if a[i] == b[j]: k += 1 elif k: unique.add(" ".join(a[i-k: i])) k = 0 i += 1 j += 1 if k: unique.add(" ".join(a[i-k: i])) return unique