Python Pandas – 读取包含多个表的csv文件

我有一个包含多个表的.csv文件。

使用pandas，从这个文件中获得两个DataFrame inventory和HPBladeSystemRack的最佳策略是什么？

input.csv如下所示：

 Inventory System Name IP Address System Status dg-enc05 Normal dg-enc05_vc_domain Unknown dg-enc05-oa1 172.20.0.213 Normal HP BladeSystem Rack System Name Rack Name Enclosure Name dg-enc05 BU40 dg-enc05-oa1 BU40 dg-enc05 dg-enc05-oa2 BU40 dg-enc05

到目前为止我所得到的最好的办法是将这个.csv文件转换成Excel工作簿（ xlxs ），将表格拆分成表格并使用：

 inventory = read_excel('path_to_file.csv', 'sheet1', skiprow=1) HPBladeSystemRack = read_excel('path_to_file.csv', 'sheet2', skiprow=2)

然而：

这种方法需要xlrd模块。
这些日志文件必须实时分析，以便find方法来分析它们，因为它们来自日志。
真正的日志有比这两个更多的表。

如果你事先知道表名，那么像这样：

 df = pd.read_csv("jahmyst2.csv", header=None, names=range(3)) table_names = ["Inventory", "HP BladeSystem Rack", "Network Interface"] groups = df[0].isin(table_names).cumsum() tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)}

应该工作产生一个字典与键作为表名和值作为子表。

 >>> list(tables) ['HP BladeSystem Rack', 'Inventory'] >>> for k,v in tables.items(): ... print("table:", k) ... print(v) ... print() ... table: HP BladeSystem Rack 0 1 2 6 System Name Rack Name Enclosure Name 7 dg-enc05 BU40 NaN 8 dg-enc05-oa1 BU40 dg-enc05 9 dg-enc05-oa2 BU40 dg-enc05 table: Inventory 0 1 2 1 System Name IP Address System Status 2 dg-enc05 NaN Normal 3 dg-enc05_vc_domain NaN Unknown 4 dg-enc05-oa1 172.20.0.213 Normal

一旦你有了，你可以设置列名到第一行，等等

我假设你知道你想parsing出csv文件的表的名字。如果是这样，你可以检索每个的index位置，并相应地select相关的切片。作为一个草图，这可能看起来像：

 df = pd.read_csv('path_to_file') index_positions = [] for table in table_names: index_positions.append(df[df['col_with_table_names']==table].index.tolist()[0]) ## Include end of table for last slice, omit for iteration below index_positions.append(df.index.tolist()[-1]) tables = {} for position in index_positions[:-1]: table_no = index_position.index(position) tables[table_names[table_no] = df.loc[position:index_positions[table_no+10]]

当然有更优雅的解决scheme，但是这应该给你一个dictionary ，其中表名作为keys和相应的表作为values 。

大pandas似乎没有准备好做到这一点，所以我最终做了我自己的split_csvfunction。它只需要表名，并将输出以每个表命名的.csv文件。

 import csv from os.path import dirname # gets parent folder in a path from os.path import join # concatenate paths table_names = ["Inventory", "HP BladeSystem Rack", "Network Interface"] def split_csv(csv_path, table_names): tables_infos = detect_tables_from_csv(csv_path, table_names) for table_info in tables_infos: split_csv_by_indexes(csv_path, table_info) def split_csv_by_indexes(csv_path, table_info): title, start_index, end_index = table_info print title, start_index, end_index dir_ = dirname(csv_path) output_path = join(dir_, title) + ".csv" with open(output_path, 'w') as output_file, open(csv_path, 'rb') as input_file: writer = csv.writer(output_file) reader = csv.reader(input_file) for i, line in enumerate(reader): if i < start_index: continue if i > end_index: break writer.writerow(line) def detect_tables_from_csv(csv_path, table_names): output = [] with open(csv_path, 'rb') as csv_file: reader = csv.reader(csv_file) for idx, row in enumerate(reader): for col in row: match = [title for title in table_names if title in col] if match: match = match[0] # get the first matching element try: end_index = idx - 1 start_index except NameError: start_index = 0 else: output.append((previous_match, start_index, end_index)) print "Found new table", col start_index = idx previous_match = match match = False end_index = idx # last 'end_index' set to EOF output.append((previous_match, start_index, end_index)) return output if __name__ == '__main__': csv_path = 'switch_records.csv' try: split_csv(csv_path, table_names) except IOError as e: print "This file doesn't exist. Aborting." print e exit(1)

Python Pandas – 读取包含多个表的csv文件

如何从* .xlsm中提取表单并将其保存为Python中的* .csv？

生成PHP CSV文件，如何更改csv列的宽度

将ColB和ColC中的数据移到ColA中的数据下面

Python组合2个XLSX文件

Python只显示.csv文件的最后一行

如何加快PHP的大数据和多个函数调用的Excel转换

SQL Server到Excel 2007 – 新行

如何使用印地语脚本导出Excel到CSV文件？

确定浏览器将要下载文件，excel和VBA的位置

修剪bash中的string中的新行字符