从多个CSV文件中抓取一个特定的列并合并成一个

我只想抓取所有csv文件中第4列的数据，并将数据写入单个文件。每个第4列都有一个唯一的标题名称，其中根文件夹的名称+ csv名称（如FolderA1

FolderA /

 1.csv |INFO INFO INFO FolderA1 INFO Apple Apple Apple Orange Apple 2.csv |INFO INFO INFO FolderA2 INFO Apple Apple Apple Cracker Apple 3.csv |INFO INFO INFO FOLDERA3 INFO Apple Apple Apple Orange Apple

我怎么能得到只有第四列的数据过滤到一个.xlsx文件，并有下一个文件夹csv放在一个新的工作表，或从以前的文件夹csv的分开？

 concentrated.xlsx | FOLDERA1 FOLDERA2 FOLDERA3 FOLDERB1 FOLDERB2 FOLDERB3 ORANGE CRACKER ORANGE ORANGE CRACKER ORANGE

我会使用pandas.read_csv自带的usecols参数。

 def read_4th(fn): return pd.read_csv(fn, delim_whitespace=1, usecols=[3]) files = ['./1.csv', './2.csv', './3.csv'] big_df = pd.concat([read_4th(fn) for fn in files], axis=1) big_df.to_excel('./mybigdf.xlsx')

对于多个文件夹使用glob 。

假设你有两个文件夹“FolderA”和“FolderB”都位于文件夹“./”中，并且你希望所有的csv文件都在。

 from glob import glob files = glob('./*/*.csv')

然后如上所述运行其余部分。

其他答案build议使用Pandas作为选项，这肯定会起作用，但是如果您正在使用纯粹的Python库寻找解决scheme，则可以尝试使用CSV模块和迭代器。

这里需要注意的是，根据需要连接的文件数量，可能会遇到内存限制。但是，放在一边，这是一个方法。

基本的Python库

 import csv from glob import glob from itertools import izip_longest, imap # Use glob to recursively get all CSV files. Adjust the pattern according to your need input_files = (open(file_path, 'rb') for file_path in glob('*.csv')) # Using generators, we can wrap all the CSV files in reader instances input_readers = (csv.reader(input_file) for input_file in input_files) with open('output.csv', 'wb') as output_file: output_writer = csv.writer(output_file) # izip_longest will return a tuple of the next value # for all the iterables passed as parameters # In this case, this means the next row for all the input_readers for rows in izip_longest(*input_readers): # We extract the fourth column in all the rows # Note that this presumes that all files have a fourth column. # Some error checking/handling might be required if # you are not sure that's the case fourth_columns = imap(lambda row: row[3], rows) # Write to the output the row that is all the # fourth columns for all the readers output_writer.writerow(fourth_columns) # Clean up the opened files map(lambda f: f.close(), input_files)

通过使用生成器，您可以最大限度地减less要一次加载到内存中的数据量，同时保持对这个问题的Pythonic方法。

使用glob模块可以使得使用已知模式加载多个文件更容易，这似乎是你的情况。随意用其他forms的文件查找replace它，如os.path.walk ，如果它更合适。

像这样的东西应该工作：

 import pandas as pd input_file_paths = ['1.csv', '2.csv', '3.csv'] dfs = (pd.read_csv(fname) for fname in input_file_paths) master_df = pd.concat( (df[[c for c in df.columns if c.lower().startswith('folder')]] for df in dfs), axis=1) master_df.to_excel('smth.xlsx')

df[[c for c in df.columns if c.lower().startswith('folder')]]行是因为您的示例文件夹列的格式不一致。

从多个CSV文件中抓取一个特定的列并合并成一个

基本的Python库

从单个单元打印多个文本string

通过范围和if语句的excel vba循环

我如何创build一个会不断在后台运行的例程？

通过在多个工作表中添加行和列中的一个范围（例如5×5）中的值并将其转储到一个特定工作表中

在没有VBA的情况下创build类似Excel的“循环”

更新string时调用循环vba excel的子

通过VBA UserForm中的checkbox控件循环

VBA 2010：关于编码实践的build议，以加速循环

如果在大型数据集上使用Excel VBA循环执行速度非常慢，然后崩溃

pandas迭代行，然后打破，直到条件