什么是最快的方法来转换arrays字母数字的每个成员?

最后的最终结果:

我想知道下面的结果是否改变,如果string更长。 我在同一台计算机上运行完全相同的testing,除了每个单元有一个34个字符的随机string,而不是四个。 这是结果:

Comintern (Regexp): 136.1 ms brettdj (Regexp): 139.9 ms Slai (Regexp): 158.4 ms *Original Regex: 161.0 ms* Comintern (AN): 170.1 ms Comintern (Hash): 183.6 ms ThunderFrame: 232.9 ms *Original replace: 372.9 ms* *Original InStr: 478.1 ms* CallumDA33: 1218.1 ms 

这真正显示了正则expression式的速度 – 使用Regex.replace的所有解决scheme都显着加快,最好的是实现了共产国际。

总之,如果string很长,使用数组,如果它们很短,则使用剪贴板。 如果不确定,最好的结果是使用数组,但是这可能会牺牲短string的一些性能。

最终结果:

非常感谢您的build议,显然我还有很多东西要学。 我昨天一直在想这个,所以我决定在家里重新运行一切。 这是最后的结果,基于将其中的每一个应用到三万四个字符的string。

我家里的电脑是英特尔i7 @ 3.6 GHz,8GB内存,64位Windows 10和Excel 2016.与之前类似的条件,我有进程在后台运行,但我没有主动做任何事情在整个testing。

 Original replace: 97.67 ms Original InStr: 106.54 ms Original Regex: 113.46 ms ThunderFrame: 82.21 ms Comintern (AN): 96.98 ms Comintern (OR): 81.87 ms Comintern (Hash): 101.18 ms brettdj: 81.66 ms CallumDA33: 201.64 ms Slai: 68.38 ms 

因此,我接受了Slai的答案,因为它显然是最快的一般实现方式,但是我将重新运行它们以对付实际数据,以便检查它是否仍然有效。


原文:

我在Excel中有一个数组是一个零件号列表。 例如,我需要将数组的每个成员变成字母数字

 ABC123-001 -> ABC123001 ABC123/001 -> ABC123001 ABC123001 -> ABC123001 

这样做的最快方法是什么?

对于上下文,我们的零件号码可以有不同的forms,所以我正在写一个函数,在给定的范围内find最好的匹配。 目前,使字母数字化的function部分需要大约50ms的运行时间,而function的其余部分总共需要大约30ms。 我也无法避免使用Excel。

我自己做了一些工作(见下面的答案),但主要的问题是我必须循环遍历数组中的每个元素 – 能有更好的方法吗? 我也从来没有运行testing之前,所以任何改进他们的反馈将不胜感激。

这是我到目前为止所尝试的。

我正在使用MicroTimer ,我的电脑有一个Intel i5 @ 2.5GHz,4GB的内存,64位的Windows 7.我有进程在后台运行,但是我没有主动做任何事情,而这些都运行。

我用这个代码创build了30,000行随机符号:

 =CHAR(RANDBETWEEN(1,60))&CHAR(RANDBETWEEN(48,57))&CHAR(RANDBETWEEN(37,140))&CHAR(RANDBETWEEN(37,140)) 

(请注意,我们如何在60处停止第一个字符,因为'='是char(61) ,我们希望避免Excel将其解释为公式。另外我们强制第二个字符是一个数字,这样我们可以保证至less一个字母数字字符在那里。)

1.基于案例使用循环。 平均时间:175ms

使用这篇文章中的函数,我们将范围加载到数组中,将该函数应用到数组的每个元素并将其粘贴回来。 码:

 Function AlphaNumericOnly(strSource As Variant) As String Dim i As Integer Dim strResult As String For i = 1 To Len(strSource) Select Case Asc(Mid(strSource, i, 1)) Case 48 To 57, 65 To 90, 97 To 122: 'include 32 if you want to include space strResult = strResult & Mid(strSource, i, 1) End Select Next AlphaNumericOnly = strResult End Function Sub Replace() Dim inputSh As Worksheet Dim inputRng As Range Set inputSh = Sheets("Data") Set inputRng = inputSh.Range("A1:A30000") Dim outputSh As Worksheet Dim outputRng As Range Set outputSh = Sheets("Replace") Set outputRng = outputSh.Range("A1:A30000") Dim time1 As Double, time2 As Double time1 = MicroTimer Dim arr As Variant arr = inputRng Dim i As Integer For i = LBound(arr) To UBound(arr) arr(i, 1) = AlphaNumericOnly(arr(i, 1)) Next i outputRng = arr time2 = MicroTimer Debug.Print (time2 - time1) * 1000 End Sub 

2.使用InStr()检查每个字符。 平均时间:201ms

定义一个有效值的string。 如果有效值出现在数组元素中,请逐个检查:

 Sub InStr() Dim inputSh As Worksheet Dim inputRng As Range Set inputSh = Sheets("Data") Set inputRng = inputSh.Range("A1:A30000") Dim outputSh As Worksheet Dim outputRng As Range Set outputSh = Sheets("InStr") Set outputRng = outputSh.Range("A1:A30000") Dim time1 As Double, time2 As Double time1 = MicroTimer Dim arr As Variant arr = inputRng Dim validValues As String validValues = "01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" 'put numbers and capitals at the start as they are more likely' Dim i As Integer, j As Integer Dim result As String For i = LBound(arr) To UBound(arr) result = vbNullString For j = 1 To Len(arr(i, 1)) If InStr(validValues, Mid(arr(i, 1), j, 1)) <> 0 Then result = result & Mid(arr(i, 1), j, 1) End If Next j arr(i, 1) = result Next i outputRng = arr time2 = MicroTimer Debug.Print (time2 - time1) * 1000 End Sub 

3.在数组上使用regex.Replace。 时间:171ms

定义一个正则expression式,并用它来replace数组中的每个元素。

 Sub Regex() Dim inputSh As Worksheet Dim inputRng As Range Set inputSh = Sheets("Data") Set inputRng = inputSh.Range("A1:A30000") Dim outputSh As Worksheet Dim outputRng As Range Set outputSh = Sheets("Regex") Set outputRng = outputSh.Range("A1:A30000") Dim time1 As Double, time2 As Double time1 = MicroTimer Dim arr As Variant arr = inputRng Dim objRegex As Object Set objRegex = CreateObject("vbscript.regexp") With objRegex .Global = True .ignorecase = True .Pattern = "[^\w]" End With Dim i As Integer For i = LBound(arr) To UBound(arr) arr(i, 1) = objRegex.Replace(arr(i, 1), vbNullString) Next i outputRng = arr time2 = MicroTimer Debug.Print (time2 - time1) * 1000 End Sub 

编辑:

@ThunderFrame – 我们的零件编号通常采用以下格式:

  • 所有的数字(例如32523452)
  • 字母和数字的组合(例如AB324K234或123H45645)
  • 字母和数字的混合,每个都由非字母数字字符(例如ABC001-001,ABC001 / 001,123 / 4557-121)

我想在每个string上使用regex.test之前启动到replace,但我不知道这是否只是复制string,然后testing它,在这种情况下,我可能只是做replace开始。

@Slai – 感谢您的链接 – 我会更详细地研究

不知道这是否会更快,因为它取决于太多因素,但可能值得testing。 而不是Regex。分别replace每个值,您可以从剪贴板中获取复制的Range文本,并一次replace所有值。 请注意, \w匹配下划线和Unicode字母,所以在正则expression式中更具体可以使其更快。

 '[a1:b30000] = [{"ABC123-009",""}]: Dim t As Double: t = Timer ' used for testing Dim r As Range, s As String Set r = ThisWorkbook.Worksheets("Data").UsedRange.Resize(, 1) ' Data!A1:A30000 With New MSForms.DataObject ' needs reference to "Microsoft Forms 2.0 Object Library" or use a bit slower late binding - With CreateObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}") r.Copy .GetFromClipboard Application.CutCopyMode = False s = .GetText .Clear ' optional - clear the clipboard if using Range.PasteSpecial instead of Worksheet.PasteSpecial "Text" With New RegExp ' needs reference to "Microsoft VBScript Regular Expressions 5.5" or use a bit slower late binding - With CreateObject("VBScript.RegExp") .Global = True '.IgnoreCase = False ' .IgnoreCase is False by default .Pattern = "[^0-9A-Za-z\r\n]+" ' because "[^\w\r\n]+" also matches _ and Unicode letters s = .Replace(s, vbNullString) End With .SetText s .PutInClipboard End With ' about 70% of the time is spent here in pasting the data r(, 2).PasteSpecial 'xlPasteValues ' paste the text from clipboard in B1 'Debug.Print Timer - t 

由于剪贴板的开销,我预计这会因为较less的值而变慢,并且由于需要内存,可能会因为更多的值而变慢。

禁用事件似乎没有在我的testing中有所作为,但可能值得尝试。

请注意,在macros使用它的时候,另一个应用程序使用剪贴板的可能性很小。

如果早期绑定导致在不同机器上运行相同的编译macros的问题,您可以searchmacros反编译器或删除引用,并切换到后期绑定。

tl; dr – 正则expression式销毁VBA实现。 如果这是一个代码挑战,@布赖特或@Slai应该赢得它。

有一些技巧使AlphaNumericOnly更快。

首先,你可以把大部分的函数调用当作一个字节数组而不是一个string。 这将删除所有对Mid$Asc的调用。 尽pipe这些function非常快,但是它们仍然会增加开销,从而跳出调用堆栈。 这加起来超过了几十万次迭代。

如果可以避免,则第二个优化是不使用Case x To y语法。 原因在于它如何编译 – 它不会编译成像Case = Condition >= x And Condition <= y ,它实际上会创build一个循环,提前退出,如下所示:

 Case = False For i = x To y If Condition = i Then Case = True End If Next 

再次,不是一个巨大的性能打击,但它加起来。 第三个优化是以一种方式对你的testing进行sorting,使他们按照你的数据集中最可能的命中sorting。 我在下面主要用字母来定制我的例子,其中大部分都是大写。 你可以用不同的顺序做得更好。 把它放在一起,你得到了这样的东西:

 Public Function ByteAlphaNumeric(source As Variant) As String Dim chars() As Byte Dim outVal() As Byte chars = CStr(source) 'Load the array up. Dim bound As Long bound = UBound(chars) 'Size the outbound array. ReDim outVal(bound) Dim i As Long, pos As Long For i = 0 To bound Step 2 'Wide characters, only care about the ASCII range. Dim temp As Byte temp = chars(i) 'Pointer math isn't free. Cache it. Select Case True 'Order is important here. Case temp > 64 And temp < 91 outVal(pos) = temp pos = pos + 2 'Advance the output pointer. Case temp < 48 Case temp > 122 Case temp > 96 outVal(pos) = temp pos = pos + 2 Case temp < 58 outVal(pos) = temp pos = pos + 2 End Select Next 'This is likely the most expensive operation. ReDim Preserve outVal(pos) 'Trim the output array. ByteAlphaNumeric = outVal End Function 

它是如何做的? 很不错:

 Public Sub Benchmark() Dim starting As Single, i As Long, dummy As String, sample As Variant sample = GetRandomString starting = Timer For i = 1 To 1000000 dummy = AlphaNumericOnlyOP(sample) Next i Debug.Print "OP's AlphaNumericOnly: ", Timer - starting starting = Timer For i = 1 To 1000000 dummy = AlphaNumericOnlyThunderframe(sample) Next i Debug.Print "ThunderFrame's AlphaNumericOnly: ", Timer - starting starting = Timer For i = 1 To 1000000 dummy = AlphaNumeric(sample) Next i Debug.Print "CallumDA33's AlphaNumeric: ", Timer - starting starting = Timer For i = 1 To 1000000 dummy = ByteAlphaNumeric(sample) Next i Debug.Print "ByteAlphaNumeric: ", Timer - starting Dim cast As String cast = CStr(sample) starting = Timer For i = 1 To 1000000 dummy = ByteAlphaNumericString(cast) Next i Debug.Print "ByteAlphaNumericString: ", Timer - starting Set stripper = Nothing starting = Timer For i = 1 To 1000000 dummy = OptimizedRegex(sample) Next i Debug.Print "OptimizedRegex: ", Timer - starting End Sub Private Function GetRandomString() As Variant Dim chars(30) As Byte, i As Long Randomize For i = 0 To 30 Step 2 chars(i) = Int(96 * Rnd + 32) Next i Dim temp As String temp = chars GetRandomString = CVar(temp) End Function 

结果用15个字符随机String

 OP`s AlphaNumericOnly: 6.565918 ThunderFrame`s AlphaNumericOnly: 3.617188 CallumDA33`s AlphaNumeric: 23.518070 ByteAlphaNumeric: 2.354980 

请注意,我省略了不太可能转换为函数的提交。 您可能会注意到2个额外的testing – ByteAlphaNumericStringByteAlphaNumeric函数完全相同,但它将一个String作为input,而不是一个Variant并且摆脱了转换。 这不是微不足道的:

 ByteAlphaNumericString: 2.226074 

最后,难以捉摸的OptimizedRegex函数(基本上是用于比较时序的函数forms的@ brettj的代码):

 Private stripper As RegExp 'Module level Function OptimizedRegex(strSource As Variant) As String If stripper Is Nothing Then Set stripper = New RegExp With stripper .Global = True .Pattern = "[^0-9A-Za-z]" End With End If OptimizedRegex = stripper.Replace(strSource, vbNullString) End Function 
 OptimizedRegex: 1.094727 

编辑:奖金实施!

在我看来,哈希表查找可能比Select Case结构快,所以我使用Scripting.Dictionary创build了一个:

 Private hash As Scripting.Dictionary 'Module level Function HashLookups(source As Variant) As String Dim chars() As Byte Dim outVal() As Byte chars = CStr(source) Dim bound As Long bound = UBound(chars) ReDim outVal(bound) Dim i As Long, pos As Long With hash For i = 0 To bound Step 2 Dim temp As Byte temp = chars(i) If .Exists(temp) Then outVal(pos) = temp pos = pos + 2 End If Next End With ReDim Preserve outVal(pos) HashLookups = outVal End Function Private Sub LoadHashTable() Set hash = New Scripting.Dictionary Dim i As Long For i = 48 To 57 hash.Add i, vbNull Next For i = 65 To 90 hash.Add i, vbNull Next For i = 97 To 122 hash.Add i, vbNull Next End Sub 'Test code: starting = Timer LoadHashTable For i = 1 To 1000000 dummy = HashLookups(sample) Next i Debug.Print "HashLookups: ", Timer - starting 

事实certificate,这不是太破旧:

 HashLookups: 1.655273 

最终版本

醒来,我想我会尝试一个vector查找,而不是一个哈希查找(只需填写一个值的字节数组保持和使用的testing)。 这似乎是合理的,因为它只是一个256个元素的数组 – 基本上是一个真值表:

 Private lookup(255) As Boolean 'Module level Function VectorLookup(source As Variant) As String Dim chars() As Byte Dim outVal() As Byte chars = CStr(source) Dim bound As Long bound = UBound(chars) ReDim outVal(bound) Dim i As Long, pos As Long For i = 0 To bound Step 2 Dim temp As Byte temp = chars(i) If lookup(temp) Then outVal(pos) = temp pos = pos + 2 End If Next ReDim Preserve outVal(pos) VectorLookup = outVal End Function Private Sub GenerateTable() Dim i As Long For i = 48 To 57 lookup(i) = True Next For i = 65 To 90 lookup(i) = True Next For i = 97 To 122 lookup(i) = True Next End Sub 

假设查找表只生成一次,它的计时速度比上面任何一种纯VBA方法快10-15%。

感谢ThunderFrame(我是LHS Mid$的吸血鬼),但是我从早期的RegExp获得了更好的性能,并进行了额外的小调整:

  • 使用Value2而不是Value
  • 声明你的循环不是整数
  • .ignorecase = True是多余的

  Sub Replace2() Dim inputSh As Worksheet Dim inputRng As Range Set inputSh = Sheets("Data") Set inputRng = inputSh.Range("A1:A30000") Dim outputSh As Worksheet Dim outputRng As Range Set outputSh = Sheets("Replace") Set outputRng = outputSh.Range("A1:A30000") Dim time1 As Double, time2 As Double time1 = MicroTimer Dim arr As Variant Dim objRegex As VBScript_RegExp_55.RegExp Dim i As Long Set objRegex = CreateObject("vbscript.regexp") With objRegex .Global = True .Pattern = "[^\w]" End With arr = inputRng.Value2 For i = LBound(arr) To UBound(arr) arr(i, 1) = objRegex.Replace(arr(i, 1), vbNullString) Next i outputRng.Value2 = arr time2 = MicroTimer Debug.Print (time2 - time1) * 1000 End Sub 

如果您将第一个function(目前最佳的function)更改为以下function,那么根据您的数据,您至less可以获得40-50%的性能提升:

 Function AlphaNumericOnly(strSource As Variant) As String Dim i As Long Dim charCount As Long Dim strResult As String Dim char As String strResult = Space$(Len(strSource)) For i = 1 To Len(strSource) char = Mid$(strSource, i, 1) Select Case Asc(char) Case 48 To 57, 65 To 90, 97 To 122: 'include 32 if you want to include space charCount = charCount + 1 Mid$(strResult, charCount, 1) = char End Select Next AlphaNumericOnly = Left$(strResult, charCount) End Function 

我使用了一些优化,但主要是在一个循环中多次重新分配strResult ,而这个循环非常昂贵,而当您的string更长(并且循环运行次数更多)时,则更为昂贵。 更好地使用Mid$

而且,使用$ -suffixed函数对string进行了优化,所以你也可以获得更好的性能

优化RegEx版本

您的正则expression式方法具有合理的性能,但是您使用的是后期绑定的CreateObject ,这会比早期绑定的强types引用快得多。

而且,你的正则expression式模式和选项每次都是一样的,你可以将正则expression式对象声明为variables,只有当它不存在时才创build它,然后每次重新使用现有的正则expression式。

我会把它扔到那里,如果没有别的,看看它是如何执行的。 我相信它也可以被整理一下。

我希望testing一个字符是一个字母的方法会变得更快。 我敢肯定,testing一个数字可以做得更快一点。

 Function AlphaNumeric(s As String) As String Dim char As String, tempStr As String Dim i As Integer Dim t As Variant For i = 1 To Len(s) char = Mid(s, i, 1) If IsLetter(char) Or IsNumber(char) Then tempStr = tempStr & char End If Next i AlphaNumeric = tempStr End Function Private Function IsLetter(s As String) As Boolean If UCase(s) = s And LCase(s) = s Then IsLetter = False Else: IsLetter = True End If End Function Private Function IsNumber(s As String) On Error GoTo 1 s = s * 1 IsNumber = True Exit Function 1: IsNumber = False End Function