Excel / VBA中的大型数据集的多条件统计(平均值,标准差,z值)

我正在计算Excel上的大数据集的统计信息,并且由于数据集大小而遇到一些问题。

看来VBA可能是要走的路,因为在数据上复制AVERAGEIF和STDDEV数组函数这个大小造成了很长的计算时间。 欣赏可能在这里使用的解决scheme或代码。

目标:

  • 要计算2个标识符(例如01/01/10的所有高度的平均值)的统计数据(avg,std dev,z-scores)
  • 能够处理大数据集(100k +数据点)

样本数据:

Date | User ID | Indicator | Data Point 01/01/10| 1 | Height | 150 01/01/10| 1 | Weight | 123 01/01/10| 2 | Height | 146 01/01/10| 2 | Weight | 123 01/02/10| 1 | Height | 156 01/02/10| 1 | Weight | 160 01/02/10| 2 | Height | 103 01/02/10| 2 | Weight | 109 

编辑:

对于新列中的每个数据点,预期输出将理想地作为z分数。 例如:01/01/10的所有高度的第一个z分数将被标准化:

 (150 - avg) / stdev 

我不知道什么是z分数,因为我得到所有数据点的相同(+/-)值。 但是我相信,你将能够修改代码来获得你想要的。 数据应该位于表单“数据”中,其中有一个名为Go的命令button,用于执行代码。 谨防! 代码清除E列以后的所有内容。

  Dim lLastRowDB As Long Dim dU1 As Object, cU1 As Variant, iU1 As Long, lrU As Long Dim dU2 As Object, cU2 As Variant, iU2 As Long Dim MyArray() As Variant Dim lAV As Double Dim lSD As Double Dim i As Integer Dim j As Integer Dim k As Integer Private Sub Go_Click() Worksheets("Data").Columns("E:EZ").Delete Shift:=xlToLeft 'Clear previous results lLastRowDB = Worksheets("Data").Cells(2, 1).End(xlDown).Row 'Assuming your data starts in A2 'Indexes from Column1 (Dates) Set dU1 = CreateObject("Scripting.Dictionary") lrU = Cells(Rows.Count, 1).End(xlUp).Row cU1 = Range("A2:A" & lrU) For iU1 = 1 To UBound(cU1, 1) dU1(cU1(iU1, 1)) = 1 Next iU1 'Indexes from Column3 (Indicators) Set dU2 = CreateObject("Scripting.Dictionary") cU2 = Range("C2:C" & lrU) j = 0 For iU2 = 1 To UBound(cU2, 1) dU2(cU2(iU2, 1)) = 1 Next iU2 'If want to see values in dictionaries, uncomment following six lines 'For i = 0 To dU1.Count - 1 ' MsgBox "dU1 has " & dU1.Count & " elements and key#" & i & " is " & dU1.Keys()(i) 'Next 'For i = 0 To dU2.Count - 1 ' MsgBox "dU2 has " & dU2.Count & " elements and key#" & i & " is " & dU2.Keys()(i) 'Next 'The following code will look in the complete set of data for each index 'This accounts for unsorted data, but is resourse-consuming 'If your data is ordered for shure, just loop the desired rows For i = 0 To dU1.Count - 1 'for each Date For j = 0 To dU2.Count - 1 'for each Indicator ReDim MyArray(1 To 1) As Variant 'reset the array For k = 2 To lLastRowDB 'Scan all rows If (Worksheets("Data").Cells(k, 1).Value = dU1.keys()(i)) Then If (Worksheets("Data").Cells(k, 3).Value = dU2.keys()(j)) Then MyArray(UBound(MyArray)) = Worksheets("Data").Cells(k, 4).Value 'add found value to array ReDim Preserve MyArray(1 To UBound(MyArray) + 1) As Variant 'now array is 1 element longer End If End If Next 'Now MyArray contains desired data. 'Get average and SD lAV = Application.WorksheetFunction.Average(MyArray) lSD = Application.WorksheetFunction.StDev(MyArray) 'Titles Worksheets("Data").Cells(1, 5) = "Average" Worksheets("Data").Cells(1, 6) = "SD" Worksheets("Data").Cells(1, 7) = "z-scores" For k = 2 To lLastRowDB If (Worksheets("Data").Cells(k, 1).Value = dU1.keys()(i)) Then If (Worksheets("Data").Cells(k, 3).Value = dU2.keys()(j)) Then Worksheets("Data").Cells(k, 5) = lAV Worksheets("Data").Cells(k, 6) = lSD If lSD = 0 Then Worksheets("Data").Cells(k, 7) = "SD is zero. Unable to calculate z-scores" Else Worksheets("Data").Cells(k, 7) = (Worksheets("Data").Cells(k, 4).Value - lAV) / lSD 'z-scores End If End If End If Next Next Next End Sub