如何根据使用C＃的CSV文件中的列find并列出重复的行。匹配/分组行。

我将一个excel文件转换成一个CSV文件。该文件包含超过10万条logging。我想通过search全名列来search并返回重复的行。如果全名匹配，我希望程序返回重复的整个行。我从一个返回一个完整名称列表的代码开始，但这是关于它的。

我列出了现在我已经在下面的代码：

public static void readCells() { var dictionary = new Dictionary<string, int>(); Console.WriteLine("started"); var counter = 1; var readText = File.ReadAllLines(path); var duplicatedValues = dictionary.GroupBy(fullName => fullName.Value).Where(fullName => fullName.Count() > 1); foreach (var s in readText) { var values = s.Split(new Char[] { ',' }); var fullName = values[3]; if (!dictionary.ContainsKey(fullName)) { dictionary.Add(fullName, 1); } else { dictionary[fullName] += 1; } Console.WriteLine("Full Name Is: " + values[3]); counter++; } } }

我改变字典使用全名作为关键：

  public static void readCells() { var dictionary = new Dictionary<string, List<List<string>>>(); Console.WriteLine("started"); var counter = 1; var readText = File.ReadAllLines(path); var duplicatedValues = dictionary.GroupBy(fullName => fullName.Value).Where(fullName => fullName.Count() > 1); foreach (var s in readText) { List<string> values = s.Split(new Char[] { ',' }).ToList(); string fullName = values[3]; if (!dictionary.ContainsKey(fullName)) { List<List<string>> newList = new List<List<string>>(); newList.Add(values); dictionary.Add(fullName, newList); } else { dictionary[fullName].Add(values); } Console.WriteLine("Full Name Is: " + values[3]); counter++; } }

我发现使用微软内置的TextFieldParser （尽pipe在Microsoft.VisualBasic.FileIO命名空间中，你可以在c＃中使用它）可以简化CSV文件的读取和parsing。

使用这种types，您的方法ReadCells()可以修改为以下扩展方法：

 using Microsoft.VisualBasic.FileIO; public static class TextFieldParserExtensions { public static List<IGrouping<string, string[]>> ReadCellsWithDuplicatedCellValues(string path, int keyCellIndex, int nRowsToSkip /* = 0 */) { using (var stream = File.OpenRead(path)) using (var parser = new TextFieldParser(stream)) { parser.SetDelimiters(new string[] { "," }); var values = parser.ReadAllFields() // If your CSV file contains header row(s) you can skip them by passing a value for nRowsToSkip .Skip(nRowsToSkip) .GroupBy(row => row.ElementAtOrDefault(keyCellIndex)) .Where(g => g.Count() > 1) .ToList(); return values; } } public static IEnumerable<string[]> ReadAllFields(this TextFieldParser parser) { if (parser == null) throw new ArgumentNullException(); while (!parser.EndOfData) yield return parser.ReadFields(); } }

你会这样称呼：

 var groups = TextFieldParserExtensions.ReadCellsWithDuplicatedCellValues(path, 3);

笔记：

TextFieldParser正确处理单元格与转义，embedded式逗号s.Split(new Char[] { ',' })不会。
由于您的CSV文件有超过10万条logging，我采用了stream策略来避免中间string[] readText内存分配。

您可以试用Cinchoo ETL – 一个开源的库来parsingCSV文件，并用几行代码来识别重复项。

下面的示例CSV文件（EmpDuplicates.csv）

 Id,Name 1,Tom 2,Mark 3,Lou 3,Lou 4,Austin 4,Austin 4,Austin

这里是如何parsing和识别重复的logging

 using (var parser = new ChoCSVReader("EmpDuplicates.csv").WithFirstLineHeader()) { foreach (dynamic c in parser.GroupBy(r => r.Id).Where(g => g.Count() > 1).Select(g => g.FirstOrDefault())) Console.WriteLine(c.DumpAsJson()); }

输出：

 { "Id": 3, "Name": "Lou" } { "Id": 4, "Name": "Austin" }

希望这可以帮助。

有关此库的更详细的用法，请访问CodeProject文章https://www.codeproject.com/Articles/1145337/Cinchoo-ETL-CSV-Reader

如何根据使用C＃的CSV文件中的列find并列出重复的行。匹配/分组行。

如何通过行索引和列索引获取单元格的值

parsingcsv文件c ++

Excel VBA到C＃

迭代通过datagrid行导出到Excel（再次）

我如何使用Parallel.ForEach for Excel？

删除数据表中的列

Excel VBA数据库查询Excel 2013

Excel复制工作表

使用EPplus将文本添加到Excel

将多个Excel文件同步到MS Access表的最佳方法

如何根据使用C＃的CSV文件中的列find并列出重复的行。 匹配/分组行。

如何通过行索引和列索引获取单元格的值

parsingcsv文件c ++

Excel VBA到C＃

迭代通过datagrid行导出到Excel（再次）

我如何使用Parallel.ForEach for Excel？

删除数据表中的列

Excel VBA数据库查询Excel 2013

Excel复制工作表

使用EPplus将文本添加到Excel

将多个Excel文件同步到MS Access表的最佳方法

如何根据使用C＃的CSV文件中的列find并列出重复的行。匹配/分组行。