100万行数据清理数据 – .Net

我正在构build一个简单的系统来清理一些原始交易数据,使用自build字典扫描一些列中的关键字并对行进行分类。

问题是程序运行缓慢。 在一百万行的数据集上,大约需要60分钟。

有没有办法让它运行得更快? 这里是我的程序框架(用.Net编写):

***使用OleDB连接读取源文件(Excel .xlsx),并使用DataAdapter将其填充到数据表中

Function ReadExcelToDatatable(filepath As String, sourceTblName As String, dataTblName As String) As DataTable ReadExcelToDatatable = New DataTable(dataTblName) Dim ext As String If Right(filepath, 4) = "xlsx" Then ext = "Xml" Else If Right(filepath, 4) = "xlsm" Then ext = "Macro" Else ext = "" Try Dim conn As New OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0; Data Source=" & filepath & ";Extended Properties = ""Excel 12.0 " & ext & "; HDR=YES; IMEX=1""") Dim adapter As New OleDb.OleDbDataAdapter("SELECT * FROM [" & sourceTblName & "]", conn) adapter.Fill(ReadExcelToDatatable) adapter.Dispose() conn.Dispose() Catch ex As Exception Console.WriteLine(ex) Console.WriteLine("Cannot read " & dataTblName & " to data table.") End Try End Function 

***对于每个字典项目,使用DataTable筛选数据表。select(筛选,sorting)并进行更改

 Sub DoFilterTables(rawTable As DataTable, dictTable As DataTable) For Each dictRow As DataRow In dictTable.Rows Try Dim rows As DataRow() = rawTable.Select(dictRow("IF COLUMN NAME 1") & " LIKE '%" & dictRow("KEYWORD 1") & "%'") For Each selectedRow As DataRow In rows If IsDBNull(selectedRow(CStr(dictRow("THEN COLUMN NAME")))) Then selectedRow(CStr(dictRow("THEN COLUMN NAME"))) = 1 Else selectedRow(CStr(dictRow("THEN COLUMN NAME"))) = dictRow("ASSIGNS KEYWORD 3") + selectedRow(CStr(dictRow("THEN COLUMN NAME"))) selectedRow.AcceptChanges() Next Catch ex As Exception Console.WriteLine(ex) Console.WriteLine("Failed to filter") End Try Next End Sub 

***将其保存为一个行文本文件

 Sub DataTable2CSV(ByVal table As DataTable, ByVal filename As String, _ ByVal sepChar As String) Dim writer As System.IO.StreamWriter Try writer = New System.IO.StreamWriter(filename) Dim str As String = "" Dim builder As New System.Text.StringBuilder For Each col As DataColumn In table.Columns str = str & col.ColumnName & sepChar Next str = str & vbCrLf writer.Write(str) Dim str2 As String = "" Dim ct As Long = 0 For Each row As DataRow In table.Rows str2 = "" For Each col As DataColumn In table.Columns Try str2 = str2 & CStr(row(col.ColumnName)) & sepChar Catch ex As Exception str2 = str2 & sepChar End Try Next str2 = str2 & vbCrLf writer.Write(str2) Next Finally End Try writer.Flush() writer.Close() End Sub 

结束模块

任何input将不胜感激。 谢谢!


编辑

原来,95%的时间都花在使用OleDB和DataAdapter将Excel工作表读入数据表中。

OleDB – > DataAdapter是最有效的方法吗?

CSV – > DataTable会更快吗?

Interop在性能方面呢?

我碰巧在我的电脑上挂了一个5GB的csv文件,里面有超过400万行,所以我写了一个例程,只读入第一百万,没有处理。 在我的电脑上,这需要7秒钟。

为了逐行处理文件,可以使用类似下面的代码片段:

  Dim counter As Integer Dim lStart, lEnd As Long lStart = Environment.TickCount Using r = System.IO.File.AppendText("C:\...\test.csv") For Each line As String In System.IO.File.ReadLines("C:\...\source.csv") r.WriteLine(line) counter += 1 If counter = 1000000 Then Exit For End If Next End Using lEnd = Environment.TickCount MsgBox("done: " & (lEnd - lStart))