在VBA中加权的Damerau-Levenshtein

我正在为Microsoft Office套件构build私人拼写检查器。我正在做string比较的错别字和他们的潜在的修复，以确定我想包括哪些更正。

我已经看了一个加权的 Damerau-Levenshtein公式进行string比较的高低，因为我想交换，插入，删除和replace都有不同的权重，而不仅仅是“1”的权重，所以我可以优先考虑一些更正超过其他人。例如，错字“agmes”在理论上可以修正为“游戏” 或 “年龄”，因为两者都只需要一个编辑就可以移动到拼写正确的单词，但是我想给“swap”编辑一个较低的权重“游戏”将显示为首选的修正。

我正在使用Excel进行分析，所以我使用的任何代码都需要在Visual Basic for Applications（VBA）中。我能find的最好的就是这个例子，这看起来不错，但它是用Java编写的。我尽我所能转换，但我远离专家，可以使用一点帮助！

任何人都可以看看附带的代码，并帮助我找出什么是错的？

谢谢！

编辑：我得到它自己的工作。这里是VBA中加权的Damerau-Levenshtein公式。它使用Excel的内置math函数进行一些评估。当比较一个拼写错误和两个可能的更正时，成本最高的更正是首选的词。这是因为两次交换的成本必须大于删除和插入的成本，如果以最低的成本（我认为是理想的）分配交换，这是不可能的。查看凯文的博客，如果你需要更多的信息。

Public Function WeightedDL(source As String, target As String) As Double Dim deleteCost As Double Dim insertCost As Double Dim replaceCost As Double Dim swapCost As Double deleteCost = 1 insertCost = 1.1 replaceCost = 1.1 swapCost = 1.2 Dim i As Integer Dim j As Integer Dim k As Integer If Len(source) = 0 Then WeightedDL = Len(target) * insertCost Exit Function End If If Len(target) = 0 Then WeightedDL = Len(source) * deleteCost Exit Function End If Dim table() As Double ReDim table(Len(source), Len(target)) Dim sourceIndexByCharacter() As Variant ReDim sourceIndexByCharacter(0 To 1, 0 To Len(source) - 1) As Variant If Left(source, 1) <> Left(target, 1) Then table(0, 0) = Application.Min(replaceCost, (deleteCost + insertCost)) End If sourceIndexByCharacter(0, 0) = Left(source, 1) sourceIndexByCharacter(1, 0) = 0 Dim deleteDistance As Double Dim insertDistance As Double Dim matchDistance As Double For i = 1 To Len(source) - 1 deleteDistance = table(i - 1, 0) + deleteCost insertDistance = ((i + 1) * deleteCost) + insertCost If Mid(source, i + 1, 1) = Left(target, 1) Then matchDistance = (i * deleteCost) + 0 Else matchDistance = (i * deleteCost) + replaceCost End If table(i, 0) = Application.Min(Application.Min(deleteDistance, insertDistance), matchDistance) Next For j = 1 To Len(target) - 1 deleteDistance = table(0, j - 1) + insertCost insertDistance = ((j + 1) * insertCost) + deleteCost If Left(source, 1) = Mid(target, j + 1, 1) Then matchDistance = (j * insertCost) + 0 Else matchDistance = (j * insertCost) + replaceCost End If table(0, j) = Application.Min(Application.Min(deleteDistance, insertDistance), matchDistance) Next For i = 1 To Len(source) - 1 Dim maxSourceLetterMatchIndex As Integer If Mid(source, i + 1, 1) = Left(target, 1) Then maxSourceLetterMatchIndex = 0 Else maxSourceLetterMatchIndex = -1 End If For j = 1 To Len(target) - 1 Dim candidateSwapIndex As Integer candidateSwapIndex = -1 For k = 0 To UBound(sourceIndexByCharacter, 2) If sourceIndexByCharacter(0, k) = Mid(target, j + 1, 1) Then candidateSwapIndex = sourceIndexByCharacter(1, k) Next Dim jSwap As Integer jSwap = maxSourceLetterMatchIndex deleteDistance = table(i - 1, j) + deleteCost insertDistance = table(i, j - 1) + insertCost matchDistance = table(i - 1, j - 1) If Mid(source, i + 1, 1) <> Mid(target, j + 1, 1) Then matchDistance = matchDistance + replaceCost Else maxSourceLetterMatchIndex = j End If Dim swapDistance As Double If candidateSwapIndex <> -1 And jSwap <> -1 Then Dim iSwap As Integer iSwap = candidateSwapIndex Dim preSwapCost If iSwap = 0 And jSwap = 0 Then preSwapCost = 0 Else preSwapCost = table(Application.Max(0, iSwap - 1), Application.Max(0, jSwap - 1)) End If swapDistance = preSwapCost + ((i - iSwap - 1) * deleteCost) + ((j - jSwap - 1) * insertCost) + swapCost Else swapDistance = 500 End If table(i, j) = Application.Min(Application.Min(Application.Min(deleteDistance, insertDistance), matchDistance), swapDistance) Next sourceIndexByCharacter(0, i) = Mid(source, i + 1, 1) sourceIndexByCharacter(1, i) = i Next WeightedDL = table(Len(source) - 1, Len(target) - 1) End Function

我可以看到你自己已经回答了这个问题：我几年前为地址匹配编写了一个修改的Levenshtein编辑距离algorithm：

 http://hairyears.livejournal.com/115867.html

…但是这样做performance不佳，而且“共同的string”方法足以完成手头的任务：

 http://excellerando.blogspot.com/2010/03/vlookup-with-fuzzy-matching-to-get.html

该代码可能需要重新testing和重新工作。

看看你的代码，如果你想重温它，这里有一个速度提示 ：

 Dim arrByte（）As Byte 
 Dim byteChar As Byte 

 arrByte = strSource

对于我= LBound（arrByte）到UBound（arrByte）步骤2 
 byteChar = arrByte（i）'我将在char上使用整数运算做一些比较操作
接下来我

即使使用Mid $（）而不是Mid（），但在VBA中的string处理非常缓慢，但是数字操作是相当不错的：string实际上是字节数组，而编译器将以字面值接受。

循环中2的“step”是跳过unicodestring中的高位字节 – 你可能正在用plain-vanilla ASCII文本运行你的string比较，你会看到字节数组（比如说） “ABCd”是（00，65，00，66，00，67，00，100）。西欧国家的大部分拉丁字母 – 重音符号，变音符号，重音符号和所有 – 都将在255以下，并且不会冒险进入在这个例子中显示为零的高位字节。

您将用希伯来语，希腊语，俄语和阿拉伯语进行严格的单语string比较，因为每个字母表中的高字节是恒定的：希腊语“αβγδ”是字节数组（03,12-03,02,03,12 ，03,12）。但是，这是草率的编码，它会咬你（或字节）你的屁股，当你尝试string比较跨语言。而且它永远不会用东方字母飞行。

相信这些线路是错误的 ： –

 deleteDistance = table(0, j - 1) + insertCost insertDistance = ((j + 1) * insertCost) + deleteCost

认为应该是： –

 deleteDistance = ((j + 1) * insertCost) + deleteCost insertDistance = table(0, j - 1) + insertCost

没有通过代码来找出发生了什么，但下面是奇怪的！

 If Left(source, 1) <> Left(target, 1) Then table(0, 0) = Application.Min(replaceCost, (deleteCost + insertCost)) End If

因为您将需要replace，删除或插入可能应该是： –

 If Left(source, 1) <> Left(target, 1) Then table(0, 0) = Application.Min(replaceCost, Application.Min(deleteCost, insertCost)) End If

在VBA中加权的Damerau-Levenshtein

如果当前没有安装，可以添加对Excel版本的引用吗？

Excel打印一个特定的数字在其他列的行

如何根据另一个单元格的值在单元格中发布自定义方程？

Excel：我怎么能给不同的人不同的保护和访问权限

Excel VBA拖放

Excel，Libreoffice / OpenOffice Calc：计数“正确”的答案

如何使这个代码不依赖于文件名

增加一个文件在excel中查找

Excel visual basic 1004错误，对象'范围'的方法'值'失败

以编程方式从C＃创buildExcel VBA代码和button