Excel – string删除重复

我正在处理一些英国地址数据，这些数据在Excel单元格中以逗号分隔。

我有一些VBA，我从网上已经删除了一些确切的重复条目，但我留下了大量的数据，其中有一些顺序和一些非顺序重复段。

随附的是一幅突出显示我正在努力实现的图像，包含了我迄今使用的不是我的代码，以向您展示我一直在寻找的方向。任何人都可以进一步思考如何实现这一目标？

Function stringOfUniques(inputString As String, delimiter As String) Dim xVal As Variant Dim dict As Object Set dict = CreateObject("Scripting.Dictionary") For Each xVal In Split(inputString, delimiter) dict(xVal) = xVal Next xVal stringOfUniques = Join(dict.Keys(), ",") End Function

这确实设法摆脱了一些他们，但是我有很大的人口，所以自动化这将是不可思议的。

理想的结果

可能不是最优雅的答案，但这是诀窍。这里我使用Split命令来分割每个逗号的string。从这返回的结果是

 bat ball banana

码：

 Option Explicit Private Sub test() Dim Mystring As String Dim StrResult As String Mystring = "bat,ball,bat,ball,banana" StrResult = shed_duplicates(Mystring) End Sub Private Function shed_duplicates(ByRef Mystring As String) As String Dim MySplitz() As String Dim J As Integer Dim K As Integer Dim BooMatch As Boolean Dim StrTemp(10) As String ' assumes no more than 10 possible splits! Dim StrResult As String MySplitz = Split(Mystring, ",") For J = 0 To UBound(MySplitz) BooMatch = False For K = 0 To UBound(StrTemp) If MySplitz(J) = StrTemp(K) Then BooMatch = True Exit For End If Next K If Not BooMatch Then StrTemp(J) = MySplitz(J) End If Next For J = 0 To UBound(StrTemp) If Len(StrTemp(J)) > 0 Then ' ignore blank entries StrResult = StrResult + StrTemp(J) + " " End If Next J Debug.Print StrResult End Function

你可能真的使用正则expression式replace：

 ^(\d*\s*([^,]*),.*)\2(,|$)

replace模式是

 $1$3

查看正则expression式演示。 模式说明 ：

^ – 开始一个string（或者一行，如果.MultiLine = True ）
(\d*\s*([^,]*),.*) – 第1组（稍后引用与$1反向引用从replace模式）匹配：
- \d* – 0+数字后跟
- \s* – 0+空白字符
- ([^,]*) – 组2（稍后我们可以使用\2 in-pattern backreference来引用此子模式捕获的值）匹配0以外的逗号
- ,.* – 一个逗号，后跟除换行符以外的0个以上的字符
\2 – 第2组捕获的文本
(,|$) – 组3（稍后引用replace模式中的$3 – 恢复逗号）匹配逗号或string结尾（或者如果.MultiLine = True则.MultiLine = True行）。

注意：如果您只是检查包含一个地址的单个单元格，则不需要.MultiLine = True 。

下面是一个示例VBA Sub展示了如何在VBA中使用它：

 Sub test() Dim regEx As Object Set regEx = CreateObject("VBScript.RegExp") With regEx .pattern = "^(\d*\s*([^,]*),.*)\2(,|$)" .Global = True .MultiLine = True ' Remove if individual addresses are matched End With s = "66 LAUSANNE ROAD,LAUSANNE ROAD,HORNSEY" & vbCrLf & _ "9 CARNELL LANE,CARNELL LANE,FERNWOOD" & vbCrLf & _ "35 FLAT ANDERSON HEIGHTS,1001 LONDON ROAD,FLAT ANDERSON HEIGHTS" & vbCrLf & _ "27 RUSSELL BANK ROAD,RUSSEL BANK,SUTTON COLDFIELD" MsgBox regEx.Replace(s, "$1$3") End Sub

在这里输入图像描述

第一个解决scheme是使用字典来获取唯一段的列表。然后就像在跳过分段之前跳过第一个地址号一样简单：

 Function RemoveDuplicates1(text As String) As String Static dict As Object If dict Is Nothing Then Set dict = CreateObject("Scripting.Dictionary") dict.CompareMode = 1 ' set the case sensitivity to All Else dict.RemoveAll End If ' Get the position just after the address number Dim c&, istart&, segment For istart = 1 To Len(text) c = Asc(Mid$(text, istart, 1)) If (c < 48 Or c > 57) And c <> 32 Then Exit For ' if not [0-9 ] Next ' Split the segments and add each one of them to the dictionary. No need to keep ' a reference to each segment since the keys are returned by order of insertion. For Each segment In Split(Mid$(text, istart), ",") If Len(segment) Then dict(segment) = Empty Next ' Return the address number and the segments by joining the keys RemoveDuplicates1 = Mid$(text, 1, istart - 1) & Join(dict.keys(), ",") End Function

第二个解决scheme是提取所有的段，然后search它们中的每一个是否存在于先前的位置：

 Function RemoveDuplicates2(text As String) As String Dim c&, segments$, segment$, length&, ifirst&, istart&, iend& ' Get the position just after the address number For ifirst = 1 To Len(text) c = Asc(Mid$(text, ifirst, 1)) If (c < 48 Or c > 57) And c <> 32 Then Exit For ' if not [0-9 ] Next ' Get the segments without the address number and add a leading/trailing comma segments = "," & Mid$(text, ifirst) & "," istart = 1 ' iterate each segment Do While istart < Len(segments) ' Get the next segment position iend = InStr(istart + 1, segments, ",") - 1 And &HFFFFFF If iend - istart Then ' Get the segment segment = Mid$(segments, istart, iend - istart + 2) ' Rewrite the segment if not present at a previous position If InStr(1, segments, segment, vbTextCompare) = istart Then Mid$(segments, length + 1) = segment length = length + Len(segment) - 1 End If End If istart = iend + 1 Loop ' Return the address number and the segments RemoveDuplicates2 = Mid$(text, 1, ifirst - 1) & Mid$(segments, 2, length - 1) End Function

第三种解决scheme是使用正则expression式来删除所有重复的段：

 Function RemoveDuplicates3(ByVal text As String) As String Static re As Object If re Is Nothing Then Set re = CreateObject("VBScript.RegExp") re.Global = True re.IgnoreCase = True ' Match any duplicated segment separated by a comma. ' The first segment is compared without the first digits. re.Pattern = "((^\d* *|,)([^,]+)(?=,).*),\3?(?=,|$)" End If ' Remove each matching segment Do While re.test(text) text = re.Replace(text, "$1") Loop RemoveDuplicates3 = text End Function

这些是10000次迭代的执行时间（越低越好）：

 input text : "123 abc,,1 abc,abc 2,ABC,abc,a,c" output text : "123 abc,1 abc,abc 2,a,c" RemoveDuplicates1 (dictionary) : 718 ms RemoveDuplicates2 (text search) : 219 ms RemoveDuplicates3 (regex) : 1469 ms

Excel – string删除重复

在stringvba excel中匹配date模式

这是RegEx是否与Excel公式中的任何单元格引用匹配？

从具有模式的Excel公式提取列名称

使用JQuery / Node将Excel工作表上传到外部API的过程是什么？

VBA 6：正则expression式不识别复杂的string

在Excel文件中find与单独文件中的文本匹配的文本，然后将填充颜色应用于行

使Excel公式将string视为正则expression式

使用正则expression式创build分隔文本

记事本++search并用制表符分隔文件replace

使用fputcsv（）将数据正确添加到csv文件