我需要三个大字符串快速函数:快速搜索,快速搜索和替换以及字符串中子字符串的快速计数。
我已经在C ++和Python中遇到过Boyer-Moore字符串搜索,但是我发现的唯一用于实现快速搜索和替换的Delphi Boyer- Moore算法是前DroopyEyes软件的Peter Morris和他的网站所组成的FastStrings的一部分。和电子邮件不再起作用。
我已经移植了FastStrings,使其在Delphi 2009/2010中的AnsiStrings上可以很好地工作,其中一个字节等于一个AnsiChar,但是在Delphi 2010中使它们也与String(UnicodeString)一起使用似乎并不简单。
使用这种Boyer- Moore算法,应该可以轻松进行不区分大小写的搜索以及不区分大小写的搜索和替换,而无需任何临时字符串(使用StrUpper等),而无需调用比Boyer- 慢的Pos()-当重复搜索同一文本时,需要进行摩尔搜索。
(编辑:我有一个部分解决方案,作为对这个问题的回答,它几乎是100%完整,甚至具有快速的字符串替换功能。我相信它一定有bug,尤其是因为它假装为Unicode能够确保由于未履行Unicode承诺而导致故障。)
(Edit2:有趣的和意外的结果;堆栈上的Unicode代码点表的堆栈很大- 以下代码中的SkipTable严重阻碍了您可以在Unicode字符串博伊尔中进行的双赢优化数量-moore字符串搜索。感谢Florent Ouchet指出了我应该立即注意到的内容。)
这个答案现在已经完成,并且适用于区分大小写的模式,但是不适用于不区分大小写的模式,并且可能还存在其他错误,因为它没有经过很好的单元测试,并且可能会进一步优化,例如,我重复了局部函数__SameChar而不是使用本来会更快的比较函数回调,实际上,允许用户传递所有这些比较函数对于希望提供一些额外逻辑(某些语言的Unicode字形的等效集合)的Unicode用户来说非常有用。 )。
基于Dorin Dominica的代码,我构建了以下代码。
{ _FindStringBoyer: Boyer-Moore search algorith using regular String instead of AnsiSTring, and no ASM. Credited to Dorin Duminica. } function _FindStringBoyer(const sString, sPattern: string; const bCaseSensitive: Boolean = True; const fromPos: Integer = 1): Integer; function __SameChar(StringIndex, PatternIndex: Integer): Boolean; begin if bCaseSensitive then Result := (sString[StringIndex] = sPattern[PatternIndex]) else Result := (CompareText(sString[StringIndex], sPattern[PatternIndex]) = 0); end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean; var SkipTable: array [Char] of Integer; LengthPattern: Integer; LengthString: Integer; Index: Integer; kIndex: Integer; LastMarker: Integer; Large: Integer; chPattern: Char; begin if fromPos < 1 then raise Exception.CreateFmt('Invalid search start position: %d.', [fromPos]); LengthPattern := Length(sPattern); LengthString := Length(sString); for chPattern := Low(Char) to High(Char) do SkipTable[chPattern] := LengthPattern; for Index := 1 to LengthPattern -1 do SkipTable[sPattern[Index]] := LengthPattern - Index; Large := LengthPattern + LengthString + 1; LastMarker := SkipTable[sPattern[LengthPattern]]; SkipTable[sPattern[LengthPattern]] := Large; Index := fromPos + LengthPattern -1; Result := 0; while Index <= LengthString do begin repeat Index := Index + SkipTable[sString[Index]]; until Index > LengthString; if Index <= Large then Break else Index := Index - Large; kIndex := 1; while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do Inc(kIndex); if kIndex = LengthPattern then begin // Found, return. Result := Index - kIndex + 1; Index := Index + LengthPattern; exit; end else begin if __SameChar(Index, LengthPattern) then Index := Index + LastMarker else Index := Index + SkipTable[sString[Index]]; end; // if kIndex = LengthPattern then begin end; // while Index <= LengthString do begin end; { Written by Warren, using the above code as a starter, we calculate the SkipTable once, and then count the number of instances of a substring inside the main string, at a much faster rate than we could have done otherwise. Another thing that would be great is to have a function that returns an array of find-locations, which would be way faster to do than repeatedly calling Pos. } function _StringCountBoyer(const aSourceString, aFindString : String; Const CaseSensitive : Boolean = TRUE) : Integer; var foundPos:Integer; fromPos:Integer; Limit:Integer; guard:Integer; SkipTable: array [Char] of Integer; LengthPattern: Integer; LengthString: Integer; Index: Integer; kIndex: Integer; LastMarker: Integer; Large: Integer; chPattern: Char; function __SameChar(StringIndex, PatternIndex: Integer): Boolean; begin if CaseSensitive then Result := (aSourceString[StringIndex] = aFindString[PatternIndex]) else Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0); end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean; begin result := 0; foundPos := 1; fromPos := 1; Limit := Length(aSourceString); guard := Length(aFindString); Index := 0; LengthPattern := Length(aFindString); LengthString := Length(aSourceString); for chPattern := Low(Char) to High(Char) do SkipTable[chPattern] := LengthPattern; for Index := 1 to LengthPattern -1 do SkipTable[aFindString[Index]] := LengthPattern - Index; Large := LengthPattern + LengthString + 1; LastMarker := SkipTable[aFindString[LengthPattern]]; SkipTable[aFindString[LengthPattern]] := Large; while (foundPos>=1) and (fromPos < Limit) and (Index<Limit) do begin Index := fromPos + LengthPattern -1; if Index>Limit then break; kIndex := 0; while Index <= LengthString do begin repeat Index := Index + SkipTable[aSourceString[Index]]; until Index > LengthString; if Index <= Large then Break else Index := Index - Large; kIndex := 1; while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do Inc(kIndex); if kIndex = LengthPattern then begin // Found, return. //Result := Index - kIndex + 1; Index := Index + LengthPattern; fromPos := Index; Inc(Result); break; end else begin if __SameChar(Index, LengthPattern) then Index := Index + LastMarker else Index := Index + SkipTable[aSourceString[Index]]; end; // if kIndex = LengthPattern then begin end; // while Index <= LengthString do begin end; end;
这确实是一个不错的算法,因为:
好吧,我用Boyer-Moore风格写了一个String Replace:
function _StringReplaceBoyer(const aSourceString, aFindString,aReplaceString : String; Flags: TReplaceFlags) : String; var errors:Integer; fromPos:Integer; Limit:Integer; guard:Integer; SkipTable: array [Char] of Integer; LengthPattern: Integer; LengthString: Integer; Index: Integer; kIndex: Integer; LastMarker: Integer; Large: Integer; chPattern: Char; CaseSensitive:Boolean; foundAt:Integer; lastFoundAt:Integer; copyStartsAt:Integer; copyLen:Integer; function __SameChar(StringIndex, PatternIndex: Integer): Boolean; begin if CaseSensitive then Result := (aSourceString[StringIndex] = aFindString[PatternIndex]) else Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0); end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean; begin result := ''; lastFoundAt := 0; fromPos := 1; errors := 0; CaseSensitive := rfIgnoreCase in Flags; Limit := Length(aSourceString); guard := Length(aFindString); Index := 0; LengthPattern := Length(aFindString); LengthString := Length(aSourceString); for chPattern := Low(Char) to High(Char) do SkipTable[chPattern] := LengthPattern; for Index := 1 to LengthPattern -1 do SkipTable[aFindString[Index]] := LengthPattern - Index; Large := LengthPattern + LengthString + 1; LastMarker := SkipTable[aFindString[LengthPattern]]; SkipTable[aFindString[LengthPattern]] := Large; while (fromPos>=1) and (fromPos <= Limit) and (Index<=Limit) do begin Index := fromPos + LengthPattern -1; if Index>Limit then break; kIndex := 0; foundAt := 0; while Index <= LengthString do begin repeat Index := Index + SkipTable[aSourceString[Index]]; until Index > LengthString; if Index <= Large then Break else Index := Index - Large; kIndex := 1; while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do Inc(kIndex); if kIndex = LengthPattern then begin foundAt := Index - kIndex + 1; Index := Index + LengthPattern; //fromPos := Index; fromPos := (foundAt+LengthPattern); if lastFoundAt=0 then begin copyStartsAt := 1; copyLen := foundAt-copyStartsAt; end else begin copyStartsAt := lastFoundAt+LengthPattern; copyLen := foundAt-copyStartsAt; end; if (copyLen<=0)or(copyStartsAt<=0) then begin Inc(errors); end; Result := Result + Copy(aSourceString, copyStartsAt, copyLen ) + aReplaceString; lastFoundAt := foundAt; if not (rfReplaceAll in Flags) then fromPos := 0; // break out of outer while loop too! break; end else begin if __SameChar(Index, LengthPattern) then Index := Index + LastMarker else Index := Index + SkipTable[aSourceString[Index]]; end; // if kIndex = LengthPattern then begin end; // while Index <= LengthString do begin end; if (lastFoundAt=0) then begin // nothing was found, just return whole original string Result := aSourceString; end else if (lastFoundAt+LengthPattern < Limit) then begin // the part that didn't require any replacing, because nothing more was found, // or rfReplaceAll flag was not specified, is copied at the // end as the final step. copyStartsAt := lastFoundAt+LengthPattern; copyLen := Limit; { this number can be larger than needed to be, and it is harmless } Result := Result + Copy(aSourceString, copyStartsAt, copyLen ); end; end;
好的,问题:堆栈的足迹:
var skiptable : array [Char] of Integer; // 65536*4 bytes stack usage on Unicode delphi
再见,CPU地狱,Hello堆栈地狱。如果要使用动态数组,则必须在运行时调整其大小。因此这件事基本上很快,因为计算机上的虚拟内存系统不会以256K的速度闪烁,但这并不是始终是最佳的代码。但是,我的PC不会像这样的大堆栈闪烁。这样的占用空间不会成为Delphi标准库的默认值,也不会在将来赢得任何快速代码挑战。我认为,在重复搜索的情况下,上述代码应作为一个类编写,而可跳过表应作为该类中的数据字段。然后,您可以构建一次boyer- moore表,随着时间的流逝,如果字符串不变,则重复使用该对象进行快速查找。