我们需要通过串联合并数据库中的3列。但是,这三列可能包含重叠的部分,并且这些部分不应重复。例如,
"a" + "b" + "c" => "abc" "abcde" + "defgh" + "ghlmn" => "abcdefghlmn" "abcdede" + "dedefgh" + "" => "abcdedefgh" "abcde" + "d" + "ghlmn" => "abcdedghlmn" "abcdef" + "" + "defghl" => "abcdefghl"
我们当前的算法非常慢,因为它使用蛮力来识别2个字符串之间的重叠部分。有谁知道一种有效的算法来做到这一点?
假设我们有2个字符串A和B。该算法需要找到最长的公共子字符串S,以便A以S结束,而B以S开头。
随附我们当前在Java中的蛮力实施以供参考,
public static String concat(String s1, String s2) { if (s1 == null) return s2; if (s2 == null) return s1; int len = Math.min(s1.length(), s2.length()); // Find the index for the end of overlapping part int index = -1; for (int i = len; i > 0; i--) { String substring = s2.substring(0, i); if (s1.endsWith(substring)) { index = i; break; } } StringBuilder sb = new StringBuilder(s1); if (index < 0) sb.append(s2); else if (index <= s2.length()) sb.append(s2.substring(index)); return sb.toString(); }
其他大多数答案都集中在恒定因数优化上,但是也可以渐近地做得更好。看一下您的算法:它是O(N ^ 2)。这似乎是一个可以比此更快解决的问题!
考虑一下Knuth Morris Pratt。它跟踪到目前为止我们匹配的最大子字符串量。这意味着它知道 在S2末尾 已经匹配了多少S1 ,这就是我们要寻找的值!只需将算法修改为继续即可,而不是在与子字符串尽早匹配时返回,而让算法返回匹配的数量,而不是最后返回0。
这为您提供了O(n)算法。真好!
int OverlappedStringLength(string s1, string s2) { //Trim s1 so it isn't longer than s2 if (s1.Length > s2.Length) s1 = s1.Substring(s1.Length - s2.Length); int[] T = ComputeBackTrackTable(s2); //O(n) int m = 0; int i = 0; while (m + i < s1.Length) { if (s2[i] == s1[m + i]) { i += 1; //<-- removed the return case here, because |s1| <= |s2| } else { m += i - T[i]; if (i > 0) i = T[i]; } } return i; //<-- changed the return here to return characters matched } int[] ComputeBackTrackTable(string s) { var T = new int[s.Length]; int cnd = 0; T[0] = -1; T[1] = 0; int pos = 2; while (pos < s.Length) { if (s[pos - 1] == s[cnd]) { T[pos] = cnd + 1; pos += 1; cnd += 1; } else if (cnd > 0) { cnd = T[cnd]; } else { T[pos] = 0; pos += 1; } } return T; }
OverlappedStringLength(“ abcdef”,“ defghl”)返回3