使用Pandas DataFrame比较两个Excel文件与不同的标题，但相同的行数据

这里我试图比较两个excel文件。 Server_report有42列，Email_report有19列（其中5列与server_report完全不匹配）。每个报告中有14个列匹配，但具有不同的标题。当我打开这两个文件时，我会对三列进行sorting，以便将数据按照“交货”，“select数量”，“批量”（按server_report）和“交货”，“采购数量”，“批量select”根据email_reportsorting）。

我需要的是将sorting后的email_report与server_report进行比较（每个文件具有相同数量的行，并且可以在“Delivery”列进行索引）。如果server_report上存在“缺失”信息，则需要使用从email_report中获取的信息填写。

之后，需要生成两个新文件。

包含所有原始42列的新server_report，其中包含来自email_report的更改。
包含比较过程中所做更改的新文件。

我的问题在这里是这篇文章的标题。如何能比较两个文件不同的列/标题（不是所有可以映射到另一个）

在这个解决scheme中，我假设这两个报告具有相同数量的行和索引。

import copy import pandas as pd import numpy as np # Example reports email_report = pd.DataFrame({'a':np.arange(6), 'b':np.arange(6), 'c':['A', 'B', email_report >>> 'C', 'D', 'E', 'F'], 'extra_col':np.zeros(6)}) abc extra_col 0 0 0 A 0.0 1 1 1 B 0.0 2 2 2 C 0.0 3 3 3 D 0.0 4 4 4 E 0.0 5 5 5 F 0.0 server_report = pd.DataFrame({'d':np.append(np.random.random_integers(0,5,5),5), 'e':np.append(np.random.random_integers(0, 5, 5),5), 'f':['A', 'B', 'C', 'D', 'E', 'F'], 'g':['a', 'b', 'c', 'd', 'e', 'f']}) server_report >>> defg 0 0 2 A a 1 1 0 B b 2 3 3 C c 3 1 3 D d 4 5 4 E e 5 5 5 F f # Identify the columns of interest in both reports and map between them col_map = {'d':'a', 'e':'b', 'f':'c'} server_report_cols = ['d', 'e', 'f'] email_report_cols = [col_map[c] for c in server_report_cols]

现在我们运行差异..

 # Run the diff report def report_diff(x): return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x) def make_diff(df1, df2, df1_cols, df2_cols): """ I am assuming that df1_cols and df2_cols are both sorted in the same order. """ temp = pd.concat([df1[df1_cols], df2[df2_cols]], axis=1) diff_rows = [] for row in temp.iterrows(): diff_rows.append([report_diff((row[1][x[0]], row[1][x[1]])) for x in zip(df1_cols, df2_cols)]) diff = copy.deepcopy(df2) diff[df2_cols] = pd.DataFrame(diff_rows, columns=df2_cols) return diff diff = make_diff(email_report, server_report, email_report_cols, server_report_cols) print diff >>> defg 0 0 ---> 2 0 ---> 5 A a 1 1 ---> 0 1 ---> 4 B b 2 2 ---> 1 2 ---> 0 C c 3 3 ---> 0 3 ---> 2 D d 4 4 ---> 5 4 E e 5 5 5 F f

并创build两个输出文件。

 # Get a boolean series indicating which rows will be changed change_check = ~(email_report[email_report_cols] == server_report[server_report_cols]. rename(columns=col_map)).all(axis=1) # Output_1: Corrected "server report" server_report[server_report_cols] = email_report[email_report_cols] # Overwrite the server report server_report.to_excel('my-diff-sap.xlsx',index=False) # Output_2: Record of corrections using the headers from the server report diff[change_check].to_excel('changed_rows.xlsx', index=False) print server_report >>> defg 0 0 0 A a 1 1 1 B b 2 2 2 C c 3 3 3 D d 4 4 4 E e 5 5 5 F f print diff[change_check] >>> defg 0 0 ---> 2 0 ---> 1 A a 1 1 ---> 0 1 ---> 5 B b 2 2 ---> 0 2 C c 3 3 ---> 5 3 ---> 4 D d 4 4 ---> 1 4 E e

使用Pandas DataFrame比较两个Excel文件与不同的标题，但相同的行数据

返回最大值的相应date

比较Excel中的两个单元格，并返回匹配的字符数

查找与一列中的键匹配的值，然后将其作为键从第三列返回值

将单元格匹配到另一个范围中最接近的最高值

将数组索引从5改为10以外的数组MATCH / PERCENTILE公式

excel中的错误公式

在Python中匹配

索引是否与多个标准匹配唯一值？

Python中的索引匹配等效

如何确定列是否匹配在Excel中？