做完了 下面是最终通过我所有测试的代码。同样,这是根据Murilo Vasconcelo的Steve Hanov算法的修改版本进行建模的。感谢所有的帮助!
/** * Computes the minimum Levenshtein Distance between the given word (represented as an array of Characters) and the * words stored in theTrie. This algorithm is modeled after Steve Hanov's blog article "Fast and Easy Levenshtein * distance using a Trie" and Murilo Vasconcelo's revised version in C++. * * http://stevehanov.ca/blog/index.php?id=114 * http://murilo.wordpress.com/2011/02/01/fast-and-easy-levenshtein-distance-using-a-trie-in-c/ * * @param ArrayList<Character> word - the characters of an input word as an array representation * @return int - the minimum Levenshtein Distance */ private int computeMinimumLevenshteinDistance(ArrayList<Character> word) { theTrie.minLevDist = Integer.MAX_VALUE; int iWordLength = word.size(); int[] currentRow = new int[iWordLength + 1]; for (int i = 0; i <= iWordLength; i++) { currentRow[i] = i; } for (int i = 0; i < iWordLength; i++) { traverseTrie(theTrie.root, word.get(i), word, currentRow); } return theTrie.minLevDist; } /** * Recursive helper function. Traverses theTrie in search of the minimum Levenshtein Distance. * * @param TrieNode node - the current TrieNode * @param char letter - the current character of the current word we're working with * @param ArrayList<Character> word - an array representation of the current word * @param int[] previousRow - a row in the Levenshtein Distance matrix */ private void traverseTrie(TrieNode node, char letter, ArrayList<Character> word, int[] previousRow) { int size = previousRow.length; int[] currentRow = new int[size]; currentRow[0] = previousRow[0] + 1; int minimumElement = currentRow[0]; int insertCost, deleteCost, replaceCost; for (int i = 1; i < size; i++) { insertCost = currentRow[i - 1] + 1; deleteCost = previousRow[i] + 1; if (word.get(i - 1) == letter) { replaceCost = previousRow[i - 1]; } else { replaceCost = previousRow[i - 1] + 1; } currentRow[i] = minimum(insertCost, deleteCost, replaceCost); if (currentRow[i] < minimumElement) { minimumElement = currentRow[i]; } } if (currentRow[size - 1] < theTrie.minLevDist && node.isWord) { theTrie.minLevDist = currentRow[size - 1]; } if (minimumElement < theTrie.minLevDist) { for (Character c : node.children.keySet()) { traverseTrie(node.children.get(c), c, word, currentRow); } } }
最后,我设法使它适用于大多数测试用例。我的实现实际上是Murilo C ++版本的Steve Hanov算法的直接翻译。那么我应该如何重构该算法和/或进行优化?下面是代码…
public int search(String word) { theTrie.minLevDist = Integer.MAX_VALUE; int size = word.length(); int[] currentRow = new int[size + 1]; for (int i = 0; i <= size; i++) { currentRow[i] = i; } for (int i = 0; i < size; i++) { char c = word.charAt(i); if (theTrie.root.children.containsKey(c)) { searchRec(theTrie.root.children.get(c), c, word, currentRow); } } return theTrie.minLevDist; } private void searchRec(TrieNode node, char letter, String word, int[] previousRow) { int size = previousRow.length; int[] currentRow = new int[size]; currentRow[0] = previousRow[0] + 1; int insertCost, deleteCost, replaceCost; for (int i = 1; i < size; i++) { insertCost = currentRow[i - 1] + 1; deleteCost = previousRow[i] + 1; if (word.charAt(i - 1) == letter) { replaceCost = previousRow[i - 1]; } else { replaceCost = previousRow[i - 1] + 1; } currentRow[i] = minimum(insertCost, deleteCost, replaceCost); } if (currentRow[size - 1] < theTrie.minLevDist && node.isWord) { theTrie.minLevDist = currentRow[size - 1]; } if (minElement(currentRow) < theTrie.minLevDist) { for (Character c : node.children.keySet()) { searchRec(node.children.get(c), c, word, currentRow); } } }
谢谢所有对此问题做出贡献的人。我试图使Levenshtein自动机工作,但我无法实现。
因此,我正在寻找有关上述代码的重构和/或优化的建议。请让我知道是否有任何混淆。与往常一样,我可以根据需要提供其余的源代码。
因此,我实现了一个简单的Trie数据结构,并且一直在尝试遵循Steve Hanov的python教程来计算Levenshtein距离。实际上,我对计算给定单词和Trie中单词之间的 最小 Levenshtein距离感兴趣,因此我一直在遵循Murilo Vasconcelos的Steve Hanov算法版本。效果不是很好,但这是我的Trie课:
public class Trie { public TrieNode root; public int minLevDist; public Trie() { this.root = new TrieNode(' '); } public void insert(String word) { int length = word.length(); TrieNode current = this.root; if (length == 0) { current.isWord = true; } for (int index = 0; index < length; index++) { char letter = word.charAt(index); TrieNode child = current.getChild(letter); if (child != null) { current = child; } else { current.children.put(letter, new TrieNode(letter)); current = current.getChild(letter); } if (index == length - 1) { current.isWord = true; } } } }
…和TrieNode类:
public class TrieNode { public final int ALPHABET = 26; public char letter; public boolean isWord; public Map<Character, TrieNode> children; public TrieNode(char letter) { this.isWord = false; this.letter = letter; children = new HashMap<Character, TrieNode>(ALPHABET); } public TrieNode getChild(char letter) { if (children != null) { if (children.containsKey(letter)) { return children.get(letter); } } return null; } }
现在,我尝试按照Murilo Vasconcelos的要求实施搜索,但是出现了问题,我需要一些调试调试帮助。请提供有关如何重构和/或指出错误在哪里的建议。我要重构的第一件事是全局变量“ minCost”,但这是最小的事情。无论如何,这是代码…
public void search(String word) { int size = word.length(); int[] currentRow = new int[size + 1]; for (int i = 0; i <= size; i++) { currentRow[i] = i; } for (int i = 0; i < size; i++) { char c = word.charAt(i); if (theTrie.root.children.containsKey(c)) { searchRec(theTrie.root.children.get(c), c, word, currentRow); } } } private void searchRec(TrieNode node, char letter, String word, int[] previousRow) { int size = previousRow.length; int[] currentRow = new int[size]; currentRow[0] = previousRow[0] + 1; int replace, insertCost, deleteCost; for (int i = 1; i < size; i++) { char c = word.charAt(i - 1); insertCost = currentRow[i - 1] + 1; deleteCost = previousRow[i] + 1; replace = (c == letter) ? previousRow[i - 1] : (previousRow[i - 1] + 1); currentRow[i] = minimum(insertCost, deleteCost, replace); } if (currentRow[size - 1] < minCost && !node.isWord) { minCost = currentRow[size - 1]; } Integer minElement = minElement(currentRow); if (minElement < minCost) { for (Map.Entry<Character, TrieNode> entry : node.children.entrySet()) { searchRec(node, entry.getKey(), word, currentRow); } } }
对于缺乏评论,我深表歉意。那我在做什么错?
我一直在阅读一篇文章,使用Trie快速而轻松地实现Levenshtein距离,以期找出一种有效的方法来计算两个字符串之间的Levenshtein距离。我的主要目标是在给定大量单词的情况下,能够找到输入单词与这组单词之间的最小Levenshtein距离。
在我的琐碎实现中,我为每个输入单词计算输入单词和单词集之间的Levenshtein距离,并返回最小值。它可以工作,但是效率不高…
我一直在寻找Java中Trie的实现,并且遇到了两个看似不错的资源:
但是,对于我想做的事情,这些实现似乎太复杂了。当我通读它们以了解它们如何工作以及Trie数据结构通常如何工作时,我变得更加困惑。
那么我将如何在Java中实现简单的Trie数据结构?我的直觉告诉我,每个TrieNode应该存储它表示的String,并且还应引用字母,而不一定是所有字母。我的直觉正确吗?
一旦实现,下一个任务是计算Levenshtein距离。我通读了上一篇文章中的Python代码示例,但是我不会说Python,一旦执行递归搜索,我的Java实现就会用完Heap内存。那么如何使用Trie数据结构计算Levenshtein距离?我有一个简单的实现,模仿了此源代码,但是它不使用Trie …效率低下。
除了您的评论和建议之外,还可以看到一些代码,这真是太好了。毕竟,这对我来说是一个学习过程……我从未实现过Trie……所以我有很多可以借鉴的经验。
谢谢。
ps我可以根据需要提供任何源代码。另外,我已经按照Nick Johnson博客中的建议通读并尝试使用BK-Tree ,但是它的效率不如我想的那样……或者我的实现是错误的。
我已经在C 中实现了“使用Trie的快速而简单的Levenshtein距离”一文中描述的算法,它的速度非常快。如果您愿意(比Python更好地理解C ),我可以将代码放在某个地方。
编辑: 我在博客上发布了它。