删除重复项,保留最后一项 – 优化

我正在处理一个macros,它将通过一个电子表格并删除重复的条目(行)根据两列(列Q和D)分开提供的两个标准。

这是我的。 我testing了一个小的数据集,它很

Sub RemoveDupesKeepLast() dim i As Integer dim criteria1, criteria2 As String Application.ScreenUpdating = False Application.Calculation = xlCalculationManual Application.EnableEvents = False 'start at bottom of sheet, go up For i = ActiveSheet.UsedRange.Rows.Count to 2 Step -1 'if there is no entry, go to next row If Cells(i, "Q").Value = "" Then GoTo gogo: End If 'set criteria that we will filter for criteria1 = Cells(i, "D").Value criteria2 = Cells(i, "Q").Value 'filter for criteria2, then criteria1 to get duplicates ActiveSheet.Range("A":"CI").AutoFilter field:=17, Criteria1:=criteria2, Operator:=xlFilterValues ActiveSheet.Range("A":"CI").AutoFilter field:=4, Criteria1:=criteria1, Operator:=xlFilterValues 'if there are duplicates, keep deleting rows until only bottom-most entry is left behind Do While Range("Q2", Cells(Rows.Count, "Q").End(xlUp)).Cells.SpecialCells(xlCellTypeVisible).Count > 1 ActiveSheet.AutoFilter.Range.Offset(1).SpecialCells(xlCellTypeVisible).Cells(1,17).EntireRow.Delete Loop 'reset autofilter If ActiveSheet.FilterMode Then Cells.AutoFilter End If gogo: Next i Application.EnableEvents = True Application.Calculation = xlCalculationAutomatic Application.ScreenUpdating = True End Sub 

有没有不同的方法可以解决这个问题,以加快速度? 就目前而言,我基本上是检查每一行,直到达到顶端。 床单实际上从30,000行到最大的任何地方。 在我看来,应该有一个更快,更干净的方式来实现我正在做的事情,但我似乎无法想到一个。

40.3秒100,00行×87列。

如果你的数据集是以30K行开始的,而且只是变得更大,你应该尽可能地在内存中处理¹。 我已经调整了此解决scheme中使用的方法,以更加紧密地遵循您的要求。

以下批量将所有值加载到variables数组中,并从结果中构build一个Scripting.Dictionary对象。 使用向字典添encryption钥的“覆盖”方法,只保留最后一个。

在执行sorting规则时,这些值将返回到重新定义的变体数组,并一起恢复到工作表。

模块1(代码)

 Option Explicit Sub removeDupesKeepLast() Dim d As Long, dDQs As Object, ky As Variant Dim r As Long, c As Long, vVALs As Variant, vTMP As Variant 'appTGGL bTGGL:=False 'uncomment this when you have finished debugging Set dDQs = CreateObject("Scripting.Dictionary") dDQs.comparemode = vbTextCompare 'step 1 - bulk load the values With Worksheets("Sheet1") 'you should know what worksheet you are on With .Cells(1, 1).CurrentRegion 'block of data radiating out from A1 With .Resize(.Rows.Count - 1, .Columns.Count).Offset(1, 0) 'step off the header row vVALs = .Value 'use .Value2 if you do not have dates in unformatted cells End With End With End With 'step 2 - build the dictionary ReDim vTMP(UBound(vVALs, 2) - 1) For r = LBound(vVALs, 1) To UBound(vVALs, 1) For c = LBound(vVALs, 2) To UBound(vVALs, 2) vTMP(c - 1) = vVALs(r, c) Next c dDQs.Item(vVALs(r, 4) & ChrW(8203) & vVALs(r, 17)) = vTMP Next r 'step 3 - put the de-duplicated values back into the array r = 0 ReDim vVALs(1 To dDQs.Count, LBound(vVALs, 2) To UBound(vVALs, 2)) For Each ky In dDQs r = r + 1 vTMP = dDQs.Item(ky) For c = LBound(vTMP) To UBound(vTMP) vVALs(r, c + 1) = vTMP(c) Next c Next ky 'step 4 - clear the destination; put the de-duplicated values back into the worksheet and reset .UsedRange With Worksheets("Sheet1") 'you should know what worksheet you are on With .Cells(1, 1).CurrentRegion 'block of data radiating out from A1 With .Resize(.Rows.Count - 1, .Columns.Count).Offset(1, 0) 'step off the header row .ClearContents 'retain formatting if it is there .Cells(1, 1).Resize(UBound(vVALs, 1), UBound(vVALs, 2)) = vVALs End With End With .UsedRange 'assert the UsedRange property (refreshes it) End With dDQs.RemoveAll: Set dDQs = Nothing appTGGL End Sub Public Sub appTGGL(Optional bTGGL As Boolean = True) With Application .ScreenUpdating = bTGGL .EnableEvents = bTGGL .DisplayAlerts = bTGGL .AutoRecover.Enabled = bTGGL 'no interruptions with an auto-save .Calculation = IIf(bTGGL, xlCalculationAutomatic, xlCalculationManual) .CutCopyMode = False .StatusBar = vbNullString End With Debug.Print Timer End Sub 

我的示例工作簿占用了100K行×87列,重复率为〜24%,并在大约40秒内处理了所有重复项(保留最后一项)。 上面回写到Sheet1; 我的testing运行写回到Sheet2为了保留原始数据。 如果您select回写到其他工作表,请确保有一些从A1开始的值,以便可以正确识别Range.CurrentRegion属性 。 testing机器是运行32位Excel 2010的旧笔记本电脑; 你自己的结果可能会有所不同。


¹ 请参阅突出显示复制和按颜色过滤选项 ,以处理Excel中处理大型数据集的提示。

这个过程删除所有由D列和Q列标识的重复行。在重复列表中,它将使行最靠近表格的底部。 基本上,在右侧创build索引列来sorting并移动底部的所有重复行,以便可以在一次调用中删除它们。 请注意,它不会更改单元格公式或格式(如果有)。

 Sub DeleteDuplicatedRows() Dim rgTable As Range, rgIndex As Range, dataColD(), dataColQ() Set rgTable = ActiveSheet.UsedRange ' load each column representing the identifier in an array dataColD = rgTable.Columns("D").value ' load values from column D dataColQ = rgTable.Columns("Q").value ' load values from column Q ' get each unique row number with a dictionary Dim dict As New VBA.collection, indexes(), r&, rr On Error Resume Next For r = UBound(dataColD) To 1 Step -1 dict.Add r, dataColD(r, 1) & vbNullChar & dataColQ(r, 1) Next On Error GoTo 0 ' index all the unique rows in an array ReDim indexes(1 To UBound(dataColD), 1 To 1) For Each rr In dict: indexes(rr, 1) = rr: Next ' insert the indexes in the last column on the right Set rgIndex = rgTable.Columns(rgTable.Columns.count + 1) rgIndex.value = indexes ' sort the rows on the indexes, duplicates will move at the end Union(rgTable, rgIndex).Sort key1:=rgIndex, Orientation:=xlTopToBottom, Header:=xlYes ' delete the index column on the right and the empty rows at the bottom rgIndex.EntireColumn.Delete rgTable.Resize(UBound(dataColD) - dict.count + 1).offset(dict.count).EntireRow.Delete End Sub