Python Pandas – 读取包含多个表的csv文件

我有一个包含多个表的.csv文件。

使用pandas,从这个文件中获得两个DataFrame inventoryHPBladeSystemRack的最佳策略是什么?

input.csv如下所示:

 Inventory System Name IP Address System Status dg-enc05 Normal dg-enc05_vc_domain Unknown dg-enc05-oa1 172.20.0.213 Normal HP BladeSystem Rack System Name Rack Name Enclosure Name dg-enc05 BU40 dg-enc05-oa1 BU40 dg-enc05 dg-enc05-oa2 BU40 dg-enc05 

到目前为止我所得到的最好的办法是将这个.csv文件转换成Excel工作簿( xlxs ),将表格拆分成表格并使用:

 inventory = read_excel('path_to_file.csv', 'sheet1', skiprow=1) HPBladeSystemRack = read_excel('path_to_file.csv', 'sheet2', skiprow=2) 

然而:

  • 这种方法需要xlrd模块。
  • 这些日志文件必须实时分析,以便find方法来分析它们,因为它们来自日志。
  • 真正的日志有比这两个更多的表。

如果你事先知道表名,那么像这样:

 df = pd.read_csv("jahmyst2.csv", header=None, names=range(3)) table_names = ["Inventory", "HP BladeSystem Rack", "Network Interface"] groups = df[0].isin(table_names).cumsum() tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)} 

应该工作产生一个字典与键作为表名和值作为子表。

 >>> list(tables) ['HP BladeSystem Rack', 'Inventory'] >>> for k,v in tables.items(): ... print("table:", k) ... print(v) ... print() ... table: HP BladeSystem Rack 0 1 2 6 System Name Rack Name Enclosure Name 7 dg-enc05 BU40 NaN 8 dg-enc05-oa1 BU40 dg-enc05 9 dg-enc05-oa2 BU40 dg-enc05 table: Inventory 0 1 2 1 System Name IP Address System Status 2 dg-enc05 NaN Normal 3 dg-enc05_vc_domain NaN Unknown 4 dg-enc05-oa1 172.20.0.213 Normal 

一旦你有了,你可以设置列名到第一行,等等

我假设你知道你想parsing出csv文件的表的名字。 如果是这样,你可以检索每个的index位置,并相应地select相关的切片。 作为一个草图,这可能看起来像:

 df = pd.read_csv('path_to_file') index_positions = [] for table in table_names: index_positions.append(df[df['col_with_table_names']==table].index.tolist()[0]) ## Include end of table for last slice, omit for iteration below index_positions.append(df.index.tolist()[-1]) tables = {} for position in index_positions[:-1]: table_no = index_position.index(position) tables[table_names[table_no] = df.loc[position:index_positions[table_no+10]] 

当然有更优雅的解决scheme,但是这应该给你一个dictionary ,其中表名作为keys和相应的表作为values

大pandas似乎没有准备好做到这一点,所以我最终做了我自己的split_csvfunction。 它只需要表名,并将输出以每个表命名的.csv文件。

 import csv from os.path import dirname # gets parent folder in a path from os.path import join # concatenate paths table_names = ["Inventory", "HP BladeSystem Rack", "Network Interface"] def split_csv(csv_path, table_names): tables_infos = detect_tables_from_csv(csv_path, table_names) for table_info in tables_infos: split_csv_by_indexes(csv_path, table_info) def split_csv_by_indexes(csv_path, table_info): title, start_index, end_index = table_info print title, start_index, end_index dir_ = dirname(csv_path) output_path = join(dir_, title) + ".csv" with open(output_path, 'w') as output_file, open(csv_path, 'rb') as input_file: writer = csv.writer(output_file) reader = csv.reader(input_file) for i, line in enumerate(reader): if i < start_index: continue if i > end_index: break writer.writerow(line) def detect_tables_from_csv(csv_path, table_names): output = [] with open(csv_path, 'rb') as csv_file: reader = csv.reader(csv_file) for idx, row in enumerate(reader): for col in row: match = [title for title in table_names if title in col] if match: match = match[0] # get the first matching element try: end_index = idx - 1 start_index except NameError: start_index = 0 else: output.append((previous_match, start_index, end_index)) print "Found new table", col start_index = idx previous_match = match match = False end_index = idx # last 'end_index' set to EOF output.append((previous_match, start_index, end_index)) return output if __name__ == '__main__': csv_path = 'switch_records.csv' try: split_csv(csv_path, table_names) except IOError as e: print "This file doesn't exist. Aborting." print e exit(1)