聚集在Excel中

你有没有人知道是否有可能在不使用VBA的情况下在Excel中实现聚类algorithm，如k-means，dbscan？

如果可能的话，你能给我一点帮助吗？（一个小例子会帮助我）

非常感谢

克劳德

你为什么 ？ Excel并不意味着这一点。

聚类algorithm经常从使用索引结构中受益匪浅，以智能的方式组织数据在内存中。例如R * – 树，kd树等

这是一个巨大的差异。没有索引的DBSCAN的复杂度是O(n^2) ，而索引结构只有O(n log n)复杂度。

你可能可以做到这一点在VBA（这不是真的Excel了，但Visual Basic），我想。但是使用R * – 树的现有代码更有意义。

检查这个

k-Meansalgorithm

k-Meansalgorithm是以下步骤的迭代，直到达到稳定性，即个体logging的群集分配不再改变。

确定质心的坐标。（最初质心是随机的，唯一的点，此后，群的成员的平均坐标被分配给质心）。确定每条logging到每个质心的欧氏距离。分组logging与他们最接近的质心。代码

首先，我创build了一个私有types来表示我们的logging和质心，并创build了两个类级别的数组来保存它们以及一个类级variables来保存正在执行分析的表。

 Private Type Records Dimension() As Double Distance() As Double Cluster As Integer End Type Dim Table As Range Dim Record() As Records Dim Centroid() As Records User Interface

下面的方法， Run()可以作为一个起点，并挂钩到button等。

 Sub Run() 'Run k-Means If Not kMeansSelection Then Call MsgBox("Error: " & Err.Description, vbExclamation, "kMeans Error") End If End Sub

接下来，创build一个方法，提示用户select要分析的表格，并input数据应分组到的所需数量的簇。该函数不需要任何参数，并返回一个布尔值，指示是否遇到任何错误。

 Function kMeansSelection() As Boolean 'Get user table selection On Error Resume Next Set Table = Application.InputBox(Prompt:= _ "Please select the range to analyse.", _ title:="Specify Range", Type:=8) If Table Is Nothing Then Exit Function 'Cancelled 'Check table dimensions If Table.Rows.Count < 4 Or Table.columns.Count < 2 Then Err.Raise Number:=vbObjectError + 1000, Source:="k-Means Cluster Analysis", Description:="Table has insufficent rows or columns." End If 'Get number of clusters Dim numClusters As Integer numClusters = Application.InputBox("Specify Number of Clusters", "k Means Cluster Analysis", Type:=1) If Not numClusters > 0 Or numClusters = False Then Exit Function 'Cancelled End If If Err.Number = 0 Then If kMeans(Table, numClusters) Then outputClusters End If End If kMeansSelection_Error: kMeansSelection = (Err.Number = 0) End Function

如果已经select了一个表并且适当地定义了多个群集，则使用表和群集数作为参数调用kMeans（Table，numClusters）方法。

如果kMeans (Table, numClusters)方法执行时没有错误，则调用最后一个方法outputClusters（），它将在活动工作簿中创build一个新工作表并输出分析结果。

分配logging到集群

这就是对logging进行实际分析的地方，也是进行分组分配的地方。首先，该方法是用函数kMeans(Table As Range, Clusters As Integer) As Boolean 。该函数带有两个参数，表格被分析为一个Excel范围对象和集群，一个整数表示要创build的集群数量。

 Function kMeans(Table As Range, Clusters As Integer) As Boolean 'Table - Range of data to group. Records (Rows) are grouped according to attributes/dimensions(columns) 'Clusters - Number of clusters to reduce records into. On Error Resume Next 'Script Performance Variables Dim PassCounter As Integer 'Initialize Data Arrays ReDim Record(2 To Table.Rows.Count) Dim r As Integer 'record Dim d As Integer 'dimension index Dim d2 As Integer 'dimension index Dim c As Integer 'centroid index Dim c2 As Integer 'centroid index Dim di As Integer 'distance Dim x As Double 'Variable Distance Placeholder Dim y As Double 'Variable Distance Placeholder

On error Resume Next用于将错误传递给调用方法，并声明一些数组索引variables。 x和y被声明为以后在math运算中使用。

第一步是将Record()数组的大小设置为表中的行数。（2到Table.Rows.Count）被使用，因为它假设（和要求）表的第一行包含列标题。

然后，对于Record()数组中的每个logging，Recordtypes的Dimension()数组的大小为列数（再次假定第一列包含行标题）， Distance()数组的大小为集群。内部循环然后将行中列的值分配给Dimension()数组。

对于r = LBound（logging）到UBound（logging）'初始化维值数组ReDimlogging（r）.Dimension（2到Table.columns.Count）'初始化距离数组ReDimlogging（r）.Distance（1到簇） d = LBound（Record（r）.Dimension）到UBound（Record（r）.Dimension）Record（r）.Dimension（d）= Table.Rows（r）.Cells（d）.Value Next d Next r

以同样的方式，初始质心必须被初始化。我已经将前几个logging的坐标分配为初始质心坐标，以检查每个新质心具有唯一坐标。如果不是，则脚本简单地移动到下一个logging，直到为质心find唯一的一组坐标。

欧几里得距离此处用于计算质心唯一性的方法几乎与以后用于计算单个logging与质心之间距离的方法完全相同。在这里通过测量它们的尺寸与0的距离来检查质心的唯一性。

  'Initialize Initial Centroid Arrays ReDim Centroid(1 To Clusters) Dim uniqueCentroid As Boolean For c = LBound(Centroid) To UBound(Centroid) 'Initialize Centroid Dimension Depth ReDim Centroid(c).Dimension(2 To Table.columns.Count) 'Initialize record index to next record r = LBound(Record) + c - 2 Do ' Loop to ensure new centroid is unique r = r + 1 'Increment record index throughout loop to find unique record to use as a centroid 'Assign record dimensions to centroid For d = LBound(Centroid(c).Dimension) To UBound(Centroid(c).Dimension) Centroid(c).Dimension(d) = Record(r).Dimension(d) Next d uniqueCentroid = True For c2 = LBound(Centroid) To c - 1 'Loop Through Record Dimensions and check if all are the same x = 0 y = 0 For d2 = LBound(Centroid(c).Dimension) To _ UBound(Centroid(c).Dimension) x = x + Centroid(c).Dimension(d2) ^ 2 y = y + Centroid(c2).Dimension(d2) ^ 2 Next d2 uniqueCentroid = Not Sqr(x) = Sqr(y) If Not uniqueCentroid Then Exit For Next c2 Loop Until uniqueCentroid Next c The next step is to calculate each records distance from each centroid and assign the record to the nearest cluster.

Dim lowestDistance As Double – LowestDistancevariables保存迄今测量的logging和质心之间测量的最短距离，以便对后续测量进行评估。 Dim lastCluster As Integer – lastClustervariables包含logging在创build任何新分配之前分配的簇，并用于评估是否已达到稳定性。 Dim ClustersStable As Boolean – 重复聚类分配和质心重新计算阶段，直到ClustersStable = true.

Dim lowestDistance As Double Dim lastCluster As Integer Dim ClustersStable As Boolean

做'群集不稳定'

 PassCounter = PassCounter + 1 ClustersStable = True 'Until Proved otherwise 'Loop Through Records

对于r = LBound（logging）到UBound（logging）

  lastCluster = Record(r).Cluster lowestDistance = 0 'Reset lowest distance 'Loop through record distances to centroids For c = LBound(Centroid) To UBound(Centroid) '====================================================== ' Calculate Euclidean Distance '====================================================== ' d(p,q) = Sqr((q1 - p1)^2 + (q2 - p2)^2 + (q3 - p3)^2) '------------------------------------------------------ ' X = (q1 - p1)^2 + (q2 - p2)^2 + (q3 - p3)^2 ' d(p,q) = X x = 0 y = 0 'Loop Through Record Dimensions For d = LBound(Record(r).Dimension) To _ UBound(Record(r).Dimension) y = Record(r).Dimension(d) - Centroid(c).Dimension(d) y = y ^ 2 x = x + y Next d x = Sqr(x) 'Get square root 'If distance to centroid is lowest (or first pass) assign record to centroid cluster. If c = LBound(Centroid) Or x < lowestDistance Then lowestDistance = x 'Assign distance to centroid to record Record(r).Distance(c) = lowestDistance 'Assign record to centroid Record(r).Cluster = c End If Next c 'Only change if true If ClustersStable Then ClustersStable = Record(r).Cluster = lastCluster Next r

一旦每个logging被分配到一个簇，簇的质心被重新定位到簇的平均坐标。在质心移动之后，每个logging最接近的质心被重新评估并且迭代该过程直到达到稳定（即，聚类指派不再改变）。

 'Move Centroids to calculated cluster average For c = LBound(Centroid) To UBound(Centroid) 'For every cluster 'Loop through cluster dimensions For d = LBound(Centroid(c).Dimension) To _ UBound(Centroid(c).Dimension) Centroid(c).Cluster = 0 'Reset nunber of records in cluster Centroid(c).Dimension(d) = 0 'Reset centroid dimensions 'Loop Through Records For r = LBound(Record) To UBound(Record) 'If Record is in Cluster then If Record(r).Cluster = c Then 'Use to calculate avg dimension for records in cluster 'Add to number of records in cluster Centroid(c).Cluster = Centroid(c).Cluster + 1 'Add record dimension to cluster dimension for later division Centroid(c).Dimension(d) = Centroid(c).Dimension(d) + _ Record(r).Dimension(d) End If Next r 'Assign Average Dimension Distance Centroid(c).Dimension(d) = Centroid(c).Dimension(d) / _ Centroid(c).Cluster Next d Next c Loop Until ClustersStable kMeans = (Err.Number = 0) End Function

显示结果

outputClusters()方法在两个表中输出结果。第一个表包含每个logging名称和分配的簇号，第二个表包含质心坐标。

 Function outputClusters() As Boolean Dim c As Integer 'Centroid Index Dim r As Integer 'Row Index Dim d As Integer 'Dimension Index Dim oSheet As Worksheet On Error Resume Next Set oSheet = addWorksheet("Cluster Analysis", ActiveWorkbook) 'Loop Through Records Dim rowNumber As Integer rowNumber = 1 'Output Headings With oSheet.Rows(rowNumber) With .Cells(1) .Value = "Row Title" .Font.Bold = True .HorizontalAlignment = xlCenter End With With .Cells(2) .Value = "Centroid" .Font.Bold = True .HorizontalAlignment = xlCenter End With End With 'Print by Row rowNumber = rowNumber + 1 'Blank Row For r = LBound(Record) To UBound(Record) oSheet.Rows(rowNumber).Cells(1).Value = Table.Rows(r).Cells(1).Value oSheet.Rows(rowNumber).Cells(2).Value = Record(r).Cluster rowNumber = rowNumber + 1 Next r 'Print Centroids - Headings rowNumber = rowNumber + 1 For d = LBound(Centroid(LBound(Centroid)).Dimension) To UBound(Centroid(LBound(Centroid)).Dimension) With oSheet.Rows(rowNumber).Cells(d) .Value = Table.Rows(1).Cells(d).Value .Font.Bold = True .HorizontalAlignment = xlCenter End With Next d 'Print Centroids rowNumber = rowNumber + 1 For c = LBound(Centroid) To UBound(Centroid) With oSheet.Rows(rowNumber).Cells(1) .Value = "Centroid " & c .Font.Bold = True End With 'Loop through cluster dimensions For d = LBound(Centroid(c).Dimension) To UBound(Centroid(c).Dimension) oSheet.Rows(rowNumber).Cells(d).Value = Centroid(c).Dimension(d) Next d rowNumber = rowNumber + 1 Next c oSheet.columns.AutoFit '//AutoFit columns to contents outputClusters_Error: outputClusters = (Err.Number = 0) End Function

这种types的输出不太可能用得上，但它可以用来演示如何在自己的解决scheme中访问logging集群分配或集群logging。

outputClusters()函数使用另一个自定义方法：addWorksheet（），它将工作表添加到指定名称的指定/活动工作簿。如果已经存在具有相同名称的工作表，则outputClusters（）函数将添加/递增附加到工作表名称的数字。 WorksheetExists()函数也包含在以下内容中：

 Function addWorksheet(Name As String, Optional Workbook As Workbook) As Worksheet On Error Resume Next '// If a Workbook wasn't specified, use the active workbook If Workbook Is Nothing Then Set Workbook = ActiveWorkbook Dim Num As Integer '// If a worksheet(s) exist with the same name, add/increment a number after the name While WorksheetExists(Name, Workbook) Num = Num + 1 If InStr(Name, " (") > 0 Then Name = Left(Name, InStr(Name, " (")) Name = Name & " (" & Num & ")" Wend '//Add a sheet to the workbook Set addWorksheet = Workbook.Worksheets.Add '//Name the sheet addWorksheet.Name = Name End Function Public Function WorksheetExists(WorkSheetName As String, Workbook As Workbook) As Boolean On Error Resume Next WorksheetExists = (Workbook.Sheets(WorkSheetName).Name <> "") On Error GoTo 0 End Function