重叠字符串连接的高效算法

小编典典

重叠字符串连接的高效算法

algorithm

我们需要通过串联合并数据库中的3列。但是，这三列可能包含重叠的部分，并且这些部分不应重复。例如，

  "a" + "b" + "c" => "abc"
  "abcde" + "defgh" + "ghlmn" => "abcdefghlmn"
  "abcdede" + "dedefgh" + "" => "abcdedefgh"
  "abcde" + "d" + "ghlmn" => "abcdedghlmn"
  "abcdef" + "" + "defghl" => "abcdefghl"

我们当前的算法非常慢，因为它使用蛮力来识别2个字符串之间的重叠部分。有谁知道一种有效的算法来做到这一点？

假设我们有2个字符串A和B。该算法需要找到最长的公共子字符串S，以便A以S结束，而B以S开头。

随附我们当前在Java中的蛮力实施以供参考，

public static String concat(String s1, String s2) {
    if (s1 == null)
        return s2;
    if (s2 == null)
        return s1;
    int len = Math.min(s1.length(), s2.length());

    // Find the index for the end of overlapping part
    int index = -1;
    for (int i = len; i > 0; i--) {
        String substring = s2.substring(0, i);
        if (s1.endsWith(substring)) {
            index = i;
            break;
        }
    }
    StringBuilder sb = new StringBuilder(s1);
    if (index < 0) 
        sb.append(s2);
    else if (index <= s2.length())
        sb.append(s2.substring(index));
    return sb.toString();
}

阅读 351

2020-07-28

共1个答案

小编典典

其他大多数答案都集中在恒定因数优化上，但是也可以渐近地做得更好。看一下您的算法：它是O（N ^ 2）。这似乎是一个可以比此更快解决的问题！

考虑一下Knuth Morris Pratt。它跟踪到目前为止我们匹配的最大子字符串量。这意味着它知道 在S2末尾 已经匹配了多少S1
，这就是我们要寻找的值！只需将算法修改为继续即可，而不是在与子字符串尽早匹配时返回，而让算法返回匹配的数量，而不是最后返回0。

这为您提供了O（n）算法。真好！

    int OverlappedStringLength(string s1, string s2) {
        //Trim s1 so it isn't longer than s2
        if (s1.Length > s2.Length) s1 = s1.Substring(s1.Length - s2.Length);

        int[] T = ComputeBackTrackTable(s2); //O(n)

        int m = 0;
        int i = 0;
        while (m + i < s1.Length) {
            if (s2[i] == s1[m + i]) {
                i += 1;
                //<-- removed the return case here, because |s1| <= |s2|
            } else {
                m += i - T[i];
                if (i > 0) i = T[i];
            }
        }

        return i; //<-- changed the return here to return characters matched
    }

    int[] ComputeBackTrackTable(string s) {
        var T = new int[s.Length];
        int cnd = 0;
        T[0] = -1;
        T[1] = 0;
        int pos = 2;
        while (pos < s.Length) {
            if (s[pos - 1] == s[cnd]) {
                T[pos] = cnd + 1;
                pos += 1;
                cnd += 1;
            } else if (cnd > 0) {
                cnd = T[cnd];
            } else {
                T[pos] = 0;
                pos += 1;
            }
        }

        return T;
    }

OverlappedStringLength（“ abcdef”，“ defghl”）返回3

2020-07-28