在VBA中处理大的分隔文本文件

使用VBA,我需要将当前处于分隔文本文件(数百列,数以万计行)中的数据“转移”为规范化forms(四列数百万行)。 也就是说,生成的表格将包含对于每个单元格的列:

  • 识别原始表格/文件;
  • 识别原表中的单元格的行;
  • 识别原表中的单元格列;
  • 包含该单元格的值。

对于如何有效完成这一任务,我一般会感激不尽。

到目前为止,我已经考虑使用ADODB构build一个SELECT INTO ... UNION ...查询来构build输出表,但是默认的文本文件提供者可悲的限制在255列(有没有哪个不是?)。

SébastienLorion构build了一个非常棒的Fast CSV Reader ,我很喜欢它,但是我不知道如何在VBA中使用它 – 感激不尽的想法(我不认为它已经被编译为导​​出COM接口,我没有工具来重新编译它)。 对于这个问题,微软也提供了一个TextFieldParser类,但我不知道是否/如何从VBA中使用。

另一种方法可能是让Excel> = 2007打开源文件,然后从那里构build输出表,但是直觉上“感觉”好像会浪费大量的开销。

编译但未经testing

 Sub UnpivotFile(sPath As String) Const DELIM As String = "," Const QUOTE As String = """" Dim FSO As New FileSystemObject Dim arrHeader Dim arrContent Dim lb As Integer, ub As Integer Dim x As Integer Dim inData As Boolean Dim l As String, fName As String Dim fIn As Scripting.TextStream Dim fOut As Scripting.TextStream Dim tmp As String Dim lineNum As Long fName = FSO.GetFileName(sPath) Set fIn = FSO.OpenTextFile(sPath, ForReading) Set fOut = FSO.OpenTextFile(sPath & "_out", ForWriting) lineNum = 0 Do While Not fIn.AtEndOfStream l = fIn.ReadLine lineNum = lineNum + 1 arrContent = ParseLineToArray(l, DELIM, QUOTE) If Not inData Then arrHeader = arrContent lb = LBound(arrHeader) ub = UBound(arrHeader) inData = True Else For x = lb To ub fOut.WriteLine Join(Array(fName, lineNum, _ QID(arrHeader(x), DELIM, QUOTE), _ QID(arrContent(x), DELIM, QUOTE)), DELIM) Next x End If Loop fIn.Close fOut.Close End Sub 'quote if delimiter found Function QID(s, d As String, q As String) QID = IIf(InStr(s, d) > -1, q & s & q, s) End Function 'Split a string into an array based on a Delimiter and a Text Identifier Private Function ParseLineToArray(sInput As String, m_Delim As String, _ m_TextIdentifier As String) As Variant 'Dim vArr As Variant Dim sArr() As String Dim bInText As Boolean Dim i As Long, n As Long Dim sTemp As String, tmp As String If sInput = "" Or InStr(1, sInput, m_Delim) = 0 Then 'zero length string, or delimiter not present 'dump all input into single-element array (minus Text Identifier) ReDim sArr(0) sArr(0) = Replace(sInput, m_TextIdentifier, "") ParseLineToArray = sArr() Else If InStr(1, sInput, m_TextIdentifier) = 0 Then 'no text identifier so just split and return sArr() = Split(sInput, m_Delim) ParseLineToArray = sArr() Else 'found the text identifier, so do it the long way bInText = False sTemp = "" n = 0 For i = 1 To Len(sInput) tmp = Mid(sInput, i, 1) If tmp = m_TextIdentifier Then 'just toggle the flag - don't add to string bInText = Not bInText Else If tmp = m_Delim Then If Not bInText Then 'delimiter not within quoted text, so add next array member ReDim Preserve sArr(n) sArr(n) = sTemp sTemp = "" n = n + 1 Else sTemp = sTemp & tmp End If Else sTemp = sTemp & tmp End If 'character is a delimiter End If 'character is a quote marker Next i ReDim Preserve sArr(n) sArr(n) = sTemp ParseLineToArray = sArr() End If 'has any quoted text End If 'parseable End Function 

这应该是足够快的(我的机器上的18MB文件需要8秒,但我只复制数据,我不重构 – 如果你不做计算,但只重新sorting的东西,你应该得到相同的性能)。 即使行数/列数不适合电子表格,它也可以工作。

TODO :它有点长,但你应该能够(a)复制粘贴它(b)改变文件名和(c)修改操作数据函数以适合你的需要。 其余的代码是一堆可重用的实用程序function,你不需要改变。

我不确定使用VBA可以快得多 – 如果你需要更快的速度,你应该考虑使用另一种语言。 通常,Java或C#中的相同代码会更短,因为它们已经有标准库来读取/写入文件等,而且速度也会更快。

 Option Explicit Public Sub doIt() Dim sourceFile As String Dim destinationFile As String Dim data As Variant Dim result As Variant sourceFile = "xxxxxxx" destinationFile = "xxxxxxx" data = getDataFromFile(sourceFile, ",") If Not isArrayEmpty(data) Then result = manipulateData(data) writeToCsv result, destinationFile, "," Else MsgBox ("Empty file") End If End Sub Function manipulateData(sourceData As Variant) As Variant Dim result As Variant Dim i As Long Dim j As Long Dim k As Long Dim m As Long 'redim the result array to the right size - here I only copy so same size as source ReDim result(1 To UBound(sourceData, 1), 1 To UBound(sourceData, 2)) As Variant For i = LBound(sourceData, 1) To UBound(sourceData, 1) For j = LBound(sourceData, 2) To UBound(sourceData, 2) k = i 'k to be defined - here I only copy data m = j 'm to be defined - here I only copy data result(k, m) = sourceData(i, j) Next j Next i manipulateData = result End Function Public Sub writeToCsv(parData As Variant, parFileName As String, parDelimiter As String) If getArrayNumberOfDimensions(parData) <> 2 Then Exit Sub Dim i As Long Dim j As Long Dim fileNum As Long Dim locLine As String Dim locCsvString As String fileNum = FreeFile If Dir(parFileName) <> "" Then Kill (parFileName) Open parFileName For Binary Lock Read Write As #fileNum For i = LBound(parData, 1) To UBound(parData, 1) locLine = "" For j = LBound(parData, 2) To UBound(parData, 2) If IsError(parData(i, j)) Then locLine = locLine & "#N/A" & parDelimiter Else locLine = locLine & parData(i, j) & parDelimiter End If Next j locLine = Left(locLine, Len(locLine) - 1) If i <> UBound(parData, 1) Then locLine = locLine & vbCrLf Put #fileNum, , locLine Next i error_handler: Close #fileNum End Sub Public Function isArrayEmpty(parArray As Variant) As Boolean 'Returns false if not an array or dynamic array that has not been initialised (ReDim) or has been erased (Erase) If IsArray(parArray) = False Then isArrayEmpty = True On Error Resume Next If UBound(parArray) < LBound(parArray) Then isArrayEmpty = True: Exit Function Else: isArrayEmpty = False End Function Public Function getArrayNumberOfDimensions(parArray As Variant) As Long 'Returns the number of dimension of an array - 0 for an empty array. Dim i As Long Dim errorCheck As Long If isArrayEmpty(parArray) Then Exit Function 'returns 0 On Error GoTo FinalDimension 'Visual Basic for Applications arrays can have up to 60000 dimensions For i = 1 To 60001 errorCheck = LBound(parArray, i) Next i 'Not supposed to happen getArrayNumberOfDimensions = 0 Exit Function FinalDimension: getArrayNumberOfDimensions = i - 1 End Function Private Function getDataFromFile(parFileName As String, parDelimiter As String, Optional parExcludeCharacter As String = "") As Variant 'parFileName is supposed to be a delimited file (csv...) 'parDelimiter is the delimiter, "," for example in a comma delimited file 'Returns an empty array if file is empty or can't be opened 'number of columns based on the line with the largest number of columns, not on the first line 'parExcludeCharacter: sometimes csv files have quotes around strings: "XXX" - if parExcludeCharacter = """" then removes the quotes Dim locLinesList() As Variant Dim locData As Variant Dim i As Long Dim j As Long Dim locNumRows As Long Dim locNumCols As Long Dim fso As Variant Dim ts As Variant Const REDIM_STEP = 10000 Set fso = CreateObject("Scripting.FileSystemObject") On Error GoTo error_open_file Set ts = fso.OpenTextFile(parFileName) On Error GoTo unhandled_error 'Counts the number of lines and the largest number of columns ReDim locLinesList(1 To 1) As Variant i = 0 Do While Not ts.AtEndOfStream If i Mod REDIM_STEP = 0 Then ReDim Preserve locLinesList(1 To UBound(locLinesList, 1) + REDIM_STEP) As Variant End If locLinesList(i + 1) = Split(ts.ReadLine, parDelimiter) j = UBound(locLinesList(i + 1), 1) 'number of columns If locNumCols < j Then locNumCols = j If j = 13 Then j = j End If i = i + 1 Loop ts.Close locNumRows = i If locNumRows = 0 Then Exit Function 'Empty file ReDim locData(1 To locNumRows, 1 To locNumCols + 1) As Variant 'Copies the file into an array If parExcludeCharacter <> "" Then For i = 1 To locNumRows For j = 0 To UBound(locLinesList(i), 1) If Left(locLinesList(i)(j), 1) = parExcludeCharacter Then If Right(locLinesList(i)(j), 1) = parExcludeCharacter Then locLinesList(i)(j) = Mid(locLinesList(i)(j), 2, Len(locLinesList(i)(j)) - 2) 'If locTempArray = "", Mid returns "" Else locLinesList(i)(j) = Right(locLinesList(i)(j), Len(locLinesList(i)(j)) - 1) End If ElseIf Right(locLinesList(i)(j), 1) = parExcludeCharacter Then locLinesList(i)(j) = Left(locLinesList(i)(j), Len(locLinesList(i)(j)) - 1) End If locData(i, j + 1) = locLinesList(i)(j) Next j Next i Else For i = 1 To locNumRows For j = 0 To UBound(locLinesList(i), 1) locData(i, j + 1) = locLinesList(i)(j) Next j Next i End If getDataFromFile = locData Exit Function error_open_file: 'returns empty variant unhandled_error: 'returns empty variant End Function 

我决定围绕VB.NET中的TextFieldParser构build一个微型的COM感知包装器。 不理想,但目前我可以提出最好的。

我以前亲自使用CSV阅读器来parsing巨大的CSV文件(高达1 GB)。 性能和简单是令人难以置信的。 我强烈build议你使用它。

既然你说你使用VB.NET,我build议你build立一个简单的控制台应用程序,引用CSV阅读器。 这个控制台应用程序将把一个csv文件的path作为一个命令行参数“unpivot”。 然后,从VBA中,您可以使用VBA.Shell来运行您的控制台应用程序,并为其提供CSV文件的path作为参数。