如何根据多个条件将大型“.csv”文件分解成小文件?

我有大的.csv文件(大约40MB),我想在几个条件下将它们分成较小的文件,并相应地将它们命名为数据:

  1. 按第3栏的内容分开文件,
  2. 内容第四列从第一点单独输出,

这是棘手的部分:

  1. 以前2次操作创build的输出检查第11列是否有数据,如果是,则将该数据分离为内容,之后再按第17列的内容分隔 – 然后保存输出/ OR / AND /
  2. 如果第11栏中没有数据,请检查第15栏并相应分开。 接下来检查17列,并通过第17列分隔这个数据 – >保存输出。

在VBA中我有这样的东西,但是对于大文件和excel来说太慢了,有时会崩溃。 有了这样的多个文件,需要很长时间才能手动切换,然后将vba放入工作。

这有可能通过这么多的条件来剪切文件吗?

在此先感谢您的帮助。

例如:(头是列的#)

1 2 3 4 11 15 17 Date Time COUNTRY CITY CHECK TEST TEST2 2015-08-20 11:54 ENGLAND ABINGDON 1 1 2015-08-21 12:54 ENGLAND BATLEY 2 5 2015-08-22 13:54 ENGLAND FROME 2 6 2015-08-23 14:54 ENGLAND FROME 2 1 2015-08-24 15:54 USA CALIFORNIA 4 8 2015-08-25 16:54 USA CONNECTICUT 4 9 2015-08-26 17:54 USA DELAWARE 1 8 2015-08-27 18:54 GERMANY SAXONY 6 9 2015-08-28 19:54 GERMANY SAXONY 6 10 2015-08-27 18:54 GERMANY SAXONY 4 11 2015-08-28 19:54 GERMANY SAXONY 4 14 2015-08-29 20:54 GERMANY HESSE 8 2015-08-29 20:54 GERMANY HESSE 1 8 File1 2015-08-20 11:54 ENGLAND ABINGDON 1 1 File2 2015-08-21 12:54 ENGLAND BATLEY 2 5 File3 2015-08-22 13:54 ENGLAND FROME 2 6 File4 2015-08-23 14:54 ENGLAND FROME 2 1 File5 2015-08-24 15:54 USA CALIFORNIA 4 8 File6 2015-08-25 16:54 USA CONNECTICUT 4 9 File7 2015-08-26 17:54 USA DELAWARE 1 8 File8 2015-08-27 18:54 GERMANY SAXONY 4 9 2015-08-28 19:54 GERMANY SAXONY 4 10 File9 2015-08-27 18:54 GERMANY SAXONY 6 11 2015-08-28 19:54 GERMANY SAXONY 6 14 File10 2015-08-29 20:54 GERMANY HESSE 8 File11 2015-08-29 20:54 GERMANY HESSE 1 8 

你的数据到处都是! 它不在您描述的列中,也不是制表符分隔的。 你不会让生活变得轻松!

试试这个awk与你的真实数据,看它是否产生一个你可以使用的输出文件名。

 awk -F'\t' '{ f=$3 "_" $4 # filename = field3 _ field4 if(length($11)){ # if f11 not null f=f "_A_" $11 "_" $17 # filename = filename _A_ field11 _ field17 }else{ # else f=f "_B_" $15 "_" $17 # filename = filename _B_ field15 _ field17 } print f}' file.csv 

你应该得到这样的东西

 ENGLAND_ABINGDON_A_3_1 ENGLAND_ABINGDON_A_4_2 GENRMANY_SAXONY_B_5_3 

基本上它使用awk并告诉它你的字段分隔符是选项卡。 然后它查看每一行,并通过查看字段如何描述来在variablesfbuild立一个输出文件名。

如果你的数据是如何分离的话,你可以通过简单的改变最后一行来完成当前行的输出,

 awk -F'\t' '{ f=$3 "_" $4 # filename = field3 _ field4 if(length($11)){ # if f11 not null f=f "_A_" $11 "_" $17 # filename = filename _A_ field11 _ field17 }else{ # else f=f "_B_" $15 "_" $17 # filename = filename _B_ field15 _ field17 } print > f}' file.csv 

基本上它打印文件,而不是打印它的名字,如果你改变

 print f 

 print > f 

制定标题

如果你想在每个输出fie的头,我们将需要努力工作一点点…

首先,我们需要从原始文件中保存标题,所以如果我们假设是第一个logging,我们会做

 awk -F'\t' ' NR==1 {header=$0} # save first line as header {f=$3 "_" $4 # filename = field3 _ field4 ... ... as before 

现在我们需要输出一个标题行,只要我们开始写一个新的文件,这是“乐趣”,因为我们只是dynamic地创build每行的输出文件名! 所以,我们需要“记住”我们已经写入的文件,然后在写入新文件时只发出一个头文件。 我在这里没有一个相当好的数据集,所以我正在猜测这一点!

 awk -F'\t' ' NR==1 {header=$0} # save first line as header {f=$3 "_" $4 # filename = field3 _ field4 if(length($11)){ # if f11 not null f=f "_A_" $11 "_" $17 # filename = filename _A_ field11 _ field17 }else{ # else f=f "_B_" $15 "_" $17 # filename = filename _B_ field15 _ field17 } # Emit header if first write to this filename if(!(f in fileswritten)){ fileswritten[f]++ # note that we have written to this file print header > f # emit header } print > f}' file.csv 

这个答案是不完整的,但粗略地说明你需要做什么:

 #!/bin/bash # Get list of countries: countries=`cat file1.csv | cut -f 3 -d$'\t'| grep -v 3 | grep -v COUNTRY | uniq` for country in ${countries}; do # Get list of cities per country: cities=`cat file1.csv | grep ${country} | cut -f 4 -d$'\t' | uniq` # Get data per country: cat file1.csv | grep ${country} > file1-${country}.csv # Get data per city per country: for city in ${cities}; do echo ${country}:${city} cat file1.csv | grep ${country} | grep ${city} > file1-${country}-${city}.csv done # Created output by 2 previous operations check if there is any data in 11th column, # if yes then separate this data accordingly to content and after that separate that # by content of 17th column -> then save outputs /OR / AND / # Column 11 is at position 5 in your data. check=`cat file1.csv | grep ${country} | cut -f 5 -d$'\t' | uniq` for check in ${checks}; do echo ${country}:${city}:${check} cat file1.csv | grep ${country} | grep ${city} > file1-${country}-${city}-${check}.csv # TODO: Further split this, I assume you get the drift by now. done # If there is no data in column 11 check column 15th and separate accordingly. # Next check 17 column and separate this data by 17th column -> save outputs. # TODO: Further split this, I assume you get the drift by now. done 

我build议写一个小脚本并使用java库CSVFormat:

 private static final String[] FILE_HEADER_MAPPING = {"Date", "Time" ,"COUNTRY", .... }; csvFileParser = new CSVParser(fileReader, csvFileFormat); List<CSVRecord> csvRecords = csvFileParser.getRecords(); 

然后访问第11栏,你必须做的

  for (int i = 1; i < csvRecords.size(); i++) { boolean publishAccount = true; CSVRecord record = csvRecords.get(i); /// here how to access record.get("Fiel column 11"); }