VBA – 匹配2个sortingstring数组,其中一些元素没有匹配优化

我有2个数据集有很多string,我需要匹配。 第一个是1200行,第二个是大约800000.我通过Excelsorting,通过VBA调用这两个集合,所以他们在升序,因此我可以显着优化search速度,通过开始下一个迭代的第二个数据集比赛。

不幸的是,当没有find匹配的时候,即使对我search的字词检查string在字母表(>我的string)中,也不会遇到Exit For 。 我试图实现比较If vData1(arrayIndex1, 1) < vData2(arrayIndex2, 1) Then (也许与'国防部'检查之前,如果这样做在每个迭代将是缓慢的),但我遇到不正确的比较值,例如?"µm">"zzzzz"在string以“a”开始之前返回true,而在数据集中它是应该的。

有没有可靠的方法来解决这个问题?

 Dim optimizedCounter as long, arrayIndex1 as long, arrayIndex2 as long Dim vData1 as variant, vData2 as variant 'sort 2 data sets in Excel ascending 'assign data sets to arrays vData1 and vData2 optimizedCounter = LBound(vData2) For arrayIndex1 = LBound(vData1) To UBound(vData1) For arrayIndex2 = optimizedCounter To UBound(vData2) If vData1(arrayIndex1, 1) = vData2(arrayIndex2, 1) Then 'do action when strings match optimizedCounter = arrayIndex2 'narrow down 2nd data set's list, arrayIndex2 + 1 if vData1 has no duplicates Exit For 'match has been found, exit loop and continue matching for next element in 1st data set End If Next arrayIndex2 Next arrayIndex1 

编辑:

谢谢大家的精彩build议。 现在ASH的Application.Evaluate / StrComp解决scheme为我做了窍门。 因为我使用默认的二进制比较vData1(arrayIndex1, 1) = vData2(arrayIndex2, 1) ,我想保留当前的速度,我不能使用选项比较文本。

 For arrayIndex1 = LBound(vData1) To UBound(vData1) For arrayIndex2 = optimizedCounter To UBound(vData2) If vData1(arrayIndex1, 1) = vData2(arrayIndex2, 1) Then 'do action when strings match optimizedCounter = arrayIndex2 'narrow down 2nd data set's list, arrayIndex2 + 1 if vData1 has no duplicates Exit For 'match has been found, exit loop and continue matching for next element in 1st data set ElseIf arrayIndex2 Mod 1000 = 0 Then If Application.Evaluate("""" & vData2(arrayIndex2, 1) & _ """ > """ & vData1(arrayIndex1, 1) & """") Then Exit For 'line below can be used instead of Application.Evaluate, the same speed, easier structure 'If StrComp(vData2(arrayIndex2, 1), vData1(arrayIndex1, 1), vbTextCompare) = 1 Then Exit For End If Next arrayIndex2 Next arrayIndex1 

由于这个方法需要一些时间,为了获得性能增益,我不得不每隔n次迭代使用它。 根据数据集的长度和匹配值的百分比,最优的mod值将会不同。

作为检查组合数量的评论,我的search条件列表包含重复。

香草代码:

执行时间:12.76

处理组合:144596591

Application.Evaluate或StrComp:

执行时间:17.30

处理的组合:1192341

在条件mod 50 = 0时的Application.Evaluate或StrComp:

执行时间:0.48

加工组合:1201717

条件mod 1000 = 0时的Application.Evaluate或StrComp:

执行时间:0.16

处理的组合:1376317

由于处理的组合数量较多,增加mod值会增加此时的执行时间。

我试过在主循环之外放置With Application ,使用.Evaluate,对速度完全没有影响。

编辑2:

Application.MatchApplication.Vlookup不适用于具有> 65536行的数组。 正如评论指出的那样,他们确实为范围工作。

 Dim vMatch as Variant, myRng as Range 'myRng is set to one-column range of values to look for, about 800K rows For arrayIndex1 = LBound(vData1) To UBound(vData1) vMatch = Application.Match(vData1(arrayIndex1, 1), myRng, 0) If Not IsError(vMatch) Then 'do action when strings match End If Next arrayIndex1 

MatchType = 0的Application.Match:

执行时间:28.81

查找次数:1200

If vData1(arrayIndex1, 1) < vData2(arrayIndex2, 1) Then …我遇到不正确的比较值,例如?"µm">"zzzzz"返回true,而在数据集中它是它应该是,在string之前从“a”开始。

事实上,如果string比较操作在先前的sorting和代码中不同,那么先前的sorting就变得毫无用处。 而这是因为

VBA中的比较默认是二进制的

 ?"µm">"zzzzz" ---> True ?Application.Evaluate("""µm"">""zzzzz""") ---> False ?StrComp("µm", "zzzzz") ---> 1 ?StrComp("µm", "zzzzz", vbTextCompare) ---> -1 ^^^^^^^^^^^^^^ 

ps,除非您在注释中strComp Option Compare TextstrComp ,或者使用Excel的比较:

 If Application.Evaluate("""" & vData1(arrayIndex1, 1) & _ """ < """ & vData2(arrayIndex2, 1) & """") Then 

这将解决比较不匹配的问题。 事实上,基于<比较来停止你的循环会让它快得多。 这是否是最好的algorithm是另一个争论。 你的数组正在sorting, 二进制search应该是一个完美的候选人。

除非进行二分search,否则考虑使用Excel的内置函数,即Application.VLookupApplication.Match ,它们几乎比VBA循环快一个数量级,即使后者在预取arrays上工作。

我用一些二进制匹配函数进行了一些testing,它在大约3秒内运行了2个数据集129K行对780K行,并进行了335K比较检查。 这是二进制search的function+一点点调整。

一些修改后的“二进制search”实用function:

 Public Function wsArrayBinaryMatch( _ ByVal val As Variant, _ arr() As Variant, _ ByVal searchCol As Long, _ Optional optimalStart As Long, Optional optimalEnd As Long, Optional exactMatch As Boolean = True) As Variant Dim a As Long, z As Long, curr As Long wsArrayBinaryMatch = "Not Found in Range" a = IIf(optimalStart, optimalStart, LBound(arr)) z = IIf(optimalEnd, optimalEnd, UBound(arr)) If compare(arr(a, searchCol), val) = 1 Then Exit Function End If If compare(arr(a, searchCol), val) = 0 Then wsArrayBinaryMatch = a Exit Function End If If compare(arr(z, searchCol), val) = -1 Then Exit Function End If While z - a > 1 curr = Round((CLng(a) + CLng(z)) / 2, 0) If compare(arr(curr, searchCol), val) = 0 Then z = curr wsArrayBinaryMatch = curr End If If compare(arr(curr, searchCol), val) = -1 Then a = curr Else z = curr End If Wend If compare(arr(z, searchCol), val) = 0 Then wsArrayBinaryMatch = z Else If Not exactMatch Then wsArrayBinaryMatch = a Else 'approx match to val was found inside the range... wsArrayBinaryMatch = "ApproxIndex" & a End If End If End Function Public Function wsArrayBinaryLookup( _ ByVal val As Variant, _ arr() As Variant, _ ByVal searchCol As Long, _ ByVal returnCol As Long, _ Optional exactMatch As Boolean = True) As Variant Dim a As Long, z As Long, curr As Long wsArrayBinaryLookup = CVErr(xlErrNA) a = LBound(arr) z = UBound(arr) If compare(arr(a, searchCol), val) = 1 Then Exit Function End If If compare(arr(a, searchCol), val) = 0 Then wsArrayBinaryLookup = arr(a, returnCol) Exit Function End If If compare(arr(z, searchCol), val) = -1 Then Exit Function End If While z - a > 1 curr = Round((CLng(a) + CLng(z)) / 2, 0) If compare(arr(curr, searchCol), val) = 0 Then z = curr wsArrayBinaryLookup = arr(curr, returnCol) End If If compare(arr(curr, searchCol), val) = -1 Then a = curr Else z = curr End If Wend If compare(arr(z, searchCol), val) = 0 Then wsArrayBinaryLookup = arr(z, returnCol) Else If Not exactMatch Then wsArrayBinaryLookup = arr(a, returnCol) End If End If End Function Public Function compare(ByVal x As Variant, ByVal y As Variant) As Long If IsNumeric(x) And IsNumeric(y) Then Select Case x - y Case Is = 0 compare = 0 Case Is > 0 compare = 1 Case Is < 0 compare = -1 End Select Else If TypeName(x) = "String" And TypeName(y) = "String" Then compare = StrComp(x, y, vbTextCompare) End If End If End Function 

然后,我写了一个试图充分利用sorting数据的子(可以转换为函数),并提高了限制search范围的效率。 这涉及在第一个数据集中寻找低和高的项目之间交替。

请注意,两个数据集每个只有两列,而且每个数据集都从第一列search匹配项。 如果find匹配,则在第一组中设置第二列的值。

用string返回大约匹配的方式有点不好意思,但是我累了…

 Sub BinaryMatchInSortedSets() Dim set1() As Variant, set2() As Variant set1 = Sheet1.Range("E2:F129601").Value '129K rows of strings and column F says 'Default' set2 = Sheet1.Range("I2:J780001").Value '780K rows of strings and numbers Dim low1 As Long, high1 As Long Dim low2 As Long, high2 As Long Dim indexToFind As Long, approxIndex As Long low1 = LBound(set1) high1 = UBound(set1) low2 = LBound(set2) high2 = UBound(set2) Dim errString As String Dim matchIndex As Variant Dim searchingForLow As Boolean: searchingForLow = True While low1 <= high1 And low2 < high2 indexToFind = IIf(searchingForLow, low1, high1) matchIndex = wsArrayBinaryMatch(set1(indexToFind, 1), set2, 1, low2, high2, True) If IsNumeric(matchIndex) Then 'match found low2 = IIf(searchingForLow, matchIndex, low2) high2 = IIf(searchingForLow, high2, matchIndex) 'do all other stuff in here that needs doing when match is found... set1(indexToFind, 2) = set2(matchIndex, 2) 'Just an example of what you could do Else 'no match, so set up efficient search range if possible If Left(matchIndex, 11) = "ApproxIndex" Then approxIndex = Mid(matchIndex, 12) If searchingForLow Then low2 = approxIndex + 1 Else high2 = approxIndex - 1 End If End If End If If searchingForLow Then low1 = low1 + 1 Else high1 = high1 - 1 End If searchingForLow = Not searchingForLow Wend Sheet1.Range("L2").Resize(UBound(set1) - LBound(set1) + 1, 2).Value = set1 End Sub