从多级Excel文件通过pandas整理数据

我想从一个看起来像这样的Excel文件产生整洁的数据,三个级别的“合并”标题:

在这里输入图像说明

pandas阅读文件很好,多级头:

# df = pandas.read_excel('test.xlsx', header=[0,1,2]) 

对于可重复性,您可以复制粘贴:

 df = pandas.DataFrame({('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'a'): {1: 'aX', 2: 'aY'}, ('Unnamed: 1_level_0', 'Unnamed: 1_level_1', 'b'): {1: 'bX', 2: 'bY'}, ('Unnamed: 2_level_0', 'Unnamed: 2_level_1', 'c'): {1: 'cX', 2: 'cY'}, ('level1_1', 'level2_1', 'level3_1'): {1: 1, 2: 10}, ('level1_1', 'level2_1', 'level3_2'): {1: 2, 2: 20}, ('level1_1', 'level2_2', 'level3_1'): {1: 3, 2: 30}, ('level1_1', 'level2_2', 'level3_2'): {1: 4, 2: 40}, ('level1_2', 'level2_1', 'level3_1'): {1: 5, 2: 50}, ('level1_2', 'level2_1', 'level3_2'): {1: 6, 2: 60}, ('level1_2', 'level2_2', 'level3_1'): {1: 7, 2: 70}, ('level1_2', 'level2_2', 'level3_2'): {1: 8, 2: 80}}) 

我想规范这个,以便水平标题是在variables行,但保留列a,b和c列:

期望的输出

没有多级头,我会做pandas.melt(df, id_vars=['a', 'b', 'c'])来得到我想要的。 pandas.melt(df)给了我想要的三个可变列,但显然不保留a,b和c列。

它应该像下面这样简单:

 wide_df = pandas.read_excel(xlfile, sheetname, header=[0, 1, 2], index_col=[0, 1, 2, 3]) long_df = wide_df.stack().stack().stack() 

下面是一个模拟CSV文件的例子(注意第四行标记索引,第一列标记标题级别):

 from io import StringIO from textwrap import dedent import pandas mockcsv = StringIO(dedent("""\ num,,,this1,this1,this1,this1,that1,that1,that1,that1 let,,,thisA,thisA,thatA,thatA,thisB,thisB,thatB,thatB animal,,,cat,dog,bird,lizard,cat,dog,bird,lizard a,b,c,,,,,,,, a1,b1,c1,x1,x2,x3,x4,x5,x6,x7,x8 a1,b1,c2,y1,y2,y3,y4,y5,y6,y7,y8 a1,b2,c1,z1,z2,z3,z4,z5,6z,zy,z8 """)) wide_df = pandas.read_csv(mockcsv, index_col=[0, 1, 2], header=[0, 1, 2]) long_df = wide_df.stack().stack().stack() 

所以wide_df看起来像这样:

 num this1 that1 let thisA thatA thisB thatB animal cat dog bird lizard cat dog bird lizard abc a1 b1 c1 x1 x2 x3 x4 x5 x6 x7 x8 c2 y1 y2 y3 y4 y5 y6 y7 y8 b2 c1 z1 z2 z3 z4 z5 6z zy z8 

long_df

 abc animal let num a1 b1 c1 bird thatA this1 x3 thatB that1 x7 cat thisA this1 x1 thisB that1 x5 dog thisA this1 x2 thisB that1 x6 lizard thatA this1 x4 thatB that1 x8 c2 bird thatA this1 y3 thatB that1 y7 cat thisA this1 y1 thisB that1 y5 dog thisA this1 y2 thisB that1 y6 lizard thatA this1 y4 thatB that1 y8 b2 c1 bird thatA this1 z3 thatB that1 zy cat thisA this1 z1 thisB that1 z5 dog thisA this1 z2 thisB that1 6z lizard thatA this1 z4 thatB that1 z8 

使用OP中显示的字面数据,您可以通过执行以下操作来获取此处不做任何修改:

 index_names = ['a', 'b', 'c'] col_names = ['Level1', 'Level2', 'Level3'] df = ( pandas.read_excel('Book1.xlsx', header=[0, 1, 2], index_col=[0, 1, 2, 3]) .reset_index(level=0, drop=True) .rename_axis(index_names, axis='index') .rename_axis(col_names, axis='columns') .stack() .stack() .stack() .to_frame() ) 

我认为棘手的部分将检查您的每个文件,找出哪些index_names应该是。

将DF分成两部分,便于融化,并将它们连接在一起。

 first_half = df.iloc[:, :3] second_half = df.iloc[:, 3:] 

融化第二个片段。

 melt_second_half = pd.melt(second_half) 

通过计算熔化的DF的行数除以它自己的长度得到的值来重复第一个片段中的值。

 repeats = int(melt_second_half.shape[0]/first_half.shape[0]) first_reps = pd.concat([first_half] * repeats, ignore_index=True) col_names = first_reps.columns.get_level_values(2) melt_first_half = pd.DataFrame(first_reps.values, columns=col_names) 

将两者连接在一起并根据列对结果DF进行sorting。

 df_concat = pd.concat([melt_first_half, melt_second_half], axis=1) df_concat.sort_values('value').reset_index(drop=True) 

图片