试图合并到一个数据框,但它不断创build新的列

我试图打开文件,并从多个电子表格派生2列(每行1),然后将它们合并到一个基础电子表格。所以,基础数据框(从电子表格中,我只需要3列)是这样的:

Model | Roadmap | Family a 08/12/17 ROW b 08/14/17 MACRO c 08/15/17 CONN d 08/27/17 MACRO 

来自多个电子表格的数据框(模型名称是电子表格名称,它们具有多个dataframe中的每个门的多个date),并具有以下格式:

  df1 (part1 - the dataframe derived from the spreadsheet with model a for gate 0 ): Model | Gate 0 a 02/01/18 df1 (Dataframe derived from the spreadsheet with model a for gate1): Model | Gate 1 a 03/01/18 df2 (part1): Model | Gate 0 b 04/23/18 df2 (part1): Model | Gate 1 b 05/23/18 

它产生的输出是:

 Model | Roadmap | Family | Gate 0_x | Gate 1_x | gate 0_y | Gate 1_y a 08/12/17 ROW 02/01/18 03/01/18 b 08/14/17 MACRO 04/23/18 05/23/18 c 08/15/17 CONN d 08/27/17 MACRO 

我想要的输出:

  Model | Roadmap | Family | Gate 0 | Gate 1 a 08/12/17 ROW 02/01/18 03/01/18 b 08/14/17 MACRO 04/23/18 05/23/18 .. 

以下是我正在使用的代码:

 import glob import pandas as pd import re import ntpath extension = 'xlsx' d='Final.xlsx' c = 'Roadmap.xlsx' dflist = [] z=[] result = [i for i in glob.glob('*.{}'.format(extension))] for b in result: if b==c: base_file = pd.read_excel(b, sheet_name='Antennas', header=7) ind1 = base_file.set_index('Model') ind1 = base_file[['Model', 'Roadmap', 'Family']] #print(ind1) ind1.to_excel('Final.xlsx') file3 = pd.read_excel('Final.xlsx') file3= file3.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True) for a in result: if a == c: base_file = pd.read_excel(a, sheet_name='Antennas', header=7) ind1 = base_file.set_index('Model') ind1 = base_file[['Model', 'Roadmap', 'Family']] ind1.to_excel('Final.xlsx') elif a != d: gates = ['Gate 0 Complete','Gate 1 Complete'] file1 = pd.read_excel('Final.xlsx') file1= file1.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True) #print(file1) file = pd.read_excel(a, sheet_name='Timeline') #print(file) models = pd.DataFrame([['','']], columns=['Model', gates]) for g in gates: z = file.loc[file['Task'] == g, 'Complete'].iloc[0] v=ntpath.basename(a) v = v[5:-5] models = pd.DataFrame([[v,z]], columns =['Model',g]) file1 = pd.merge(file1, models, how='left', on='Model') file3 = pd.merge(file3, file1, how='left' ,['Model','Roadmap','Family']) file3.to_excel('new.xlsx') 

file3是在for循环之前作为基本文件的数据框打开的文件。 如果有什么不清楚,请告诉我。

目前,你正在合并两次,但真的需要合并基地与个人的dfs,然后将所有与pd.concat一起追加。

下面重新创build您上面发布的示例,它假定与Excel文件结构相同,并演示合并和追加步骤。 您会注意到,由于左连接合并呈现相同的行值,所以使用了drop_duplicates 。 在实际数据上保留或删除此方法。

数据

 from io import StringIO import pandas as pd txt = ''' Model Roadmap Family a some_date some b some_date some c some_date some d some_date some ''' base_df = pd.read_table(StringIO(txt), sep="\s+") txt = ''' Model "Gate 0" "Gate 1" a some_date some ''' df1 = pd.read_table(StringIO(txt), sep="\s+") txt = ''' Model "Gate 0" "Gate 1" b some_date some ''' df2 = pd.read_table(StringIO(txt), sep="\s+") 

合并和追加 (使用列表理解)

 finaldf = pd.concat([pd.merge(base_df, df, how='left', on='Model') for df in [df1, df2]], ignore_index=True).drop_duplicates() print(finaldf) # Model Roadmap Family Gate 0 Gate 1 # 0 a some_date some some_date some # 1 b some_date some NaN NaN # 2 c some_date some NaN NaN # 3 d some_date some NaN NaN # 4 a some_date some NaN NaN # 5 b some_date some some_date some 

要整合到当前stream程中,请考虑将各个模型附加到列表中,以便在最后连接和合并。 build立base_df作为你上面发布的例子。

 ... dfList = [] for g in gates: z = file.loc[file['Task'] == g, 'Complete'].iloc[0] v = ntpath.basename(a) v = v[5:-5] mod = pd.DataFrame([[v,z]], columns =['Model',g]) models = pd.merge(models, mod, how='left', on='Model') dfList.append(models) finaldf = pd.merge(base_df, pd.concat(dfList), how='left', on='Model') finaldf.to_excel('Final_Dataset.xlsx') 

得到了如何做到这一点。 让我知道如果你发现任何问题。

 import glob import pandas as pd import re import ntpath extension = 'xlsx' d='Final.xlsx' c = 'Roadmap.xlsx' dflist = [] z=[] result = [i for i in glob.glob('*.{}'.format(extension))] for a in result: if a == c: base_file = pd.read_excel(a, sheet_name='Antennas', header=7) ind1 = base_file.set_index('Model') ind1 = base_file[['Model', 'Roadmap', 'Family']] #print(ind1) ind1.to_excel('Final.xlsx') elif a != d: v=ntpath.basename(a) v = v[5:-5] gates = ['Gate 0 Complete','Gate 1 Complete', 'Gate 2 Complete'] file1 = pd.read_excel('Final.xlsx') file1= file1.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True) #print(file1) file = pd.read_excel(a, sheet_name='Timeline') #print(file) models = pd.DataFrame([[v]], columns=['Model']) #print(models) for g in gates: z = file.loc[file['Task'] == g, 'Complete'].iloc[0] #print(z) #v = re.findall(r'Scrum(\w+)', a) #print(v) #df1=pd.DataFrame([[v,z]], columns = ['Model',g]) mod = pd.DataFrame([[v,z]], columns =['Model',g]) models=pd.merge(models, mod, how='left', on='Model') #print(models) dflist.append(models) #print(dflist) file1 = pd.merge(file1,pd.concat(dflist), how='left',on='Model') file1.to_excel('new.xlsx') 

我假设你的原始数据如下:

  1. 第0 df_base – 第1部分。加载df_base
  2. 第0 df1 – 第2部分。您加载df1df2等 – 每个工作表一个df

那么我的做法是执行以下步骤(按顺序):

  1. 将所有工作表的df垂直连接到名为df_sheets单个df_sheets
  2. df_basedf_sheets合并以获得所需的输出

基于此,我的做法是:

 import pandas as pd # STEP 0. cv = ['a','b','c','d'] nr = 4 # STEP 0 - Part 1. Load Base DF cv = cv[:nr] df_base = pd.DataFrame(zip(*[cv,['some_date']*nr,['some']*nr]), columns=['Model','Roadmap','Family']) # STEP 0 - Part 2. Load Sheets DataFrames df_sheets = [] for alph in cv: df_sheet = pd.DataFrame(zip(*[[alph]*nr,['some_date_'+alph]*nr,['some_'+alph]*nr]), columns=['Model','Gate0','Gate1']) df_sheets.append(df_sheet) print('Base DF:\n{}' .format(df_base)) # STEP 1. Vertically conctenate all sheets DataFrames together df_sheets = pd.concat(df_sheets, axis=0).reset_index(drop=True) print('\nDataFrames for all sheets (vertically concatenated into single DataFrame):\n{}' .format(df_sheets)) # STEP 2. base INNER JOIN sheets USING ('Model') ndf = df_base.merge(df_sheets, on='Model', how='inner') print('\nOutput DataFrame:\n{}' .format(ndf)) 

输出是:

 Base DF: Model Roadmap Family 0 a some_date some 1 b some_date some 2 c some_date some 3 d some_date some DataFrames for all sheets (vertically concatenated into single DataFrame): Model Gate0 Gate1 0 a some_date_a some_a 1 a some_date_a some_a 2 a some_date_a some_a 3 a some_date_a some_a 4 b some_date_b some_b 5 b some_date_b some_b 6 b some_date_b some_b 7 b some_date_b some_b 8 c some_date_c some_c 9 c some_date_c some_c 10 c some_date_c some_c 11 c some_date_c some_c 12 d some_date_d some_d 13 d some_date_d some_d 14 d some_date_d some_d 15 d some_date_d some_d Output DataFrame: Model Roadmap Family Gate0 Gate1 0 a some_date some some_date_a some_a 1 a some_date some some_date_a some_a 2 a some_date some some_date_a some_a 3 a some_date some some_date_a some_a 4 b some_date some some_date_b some_b 5 b some_date some some_date_b some_b 6 b some_date some some_date_b some_b 7 b some_date some some_date_b some_b 8 c some_date some some_date_c some_c 9 c some_date some some_date_c some_c 10 c some_date some some_date_c some_c 11 c some_date some some_date_c some_c 12 d some_date some some_date_d some_d 13 d some_date some some_date_d some_d 14 d some_date some some_date_d some_d 15 d some_date some some_date_d some_d 

这是你以后?