比较2 excel文件,保留1张固定的1张,然后用python与另一个同一列的文件进行比较

我们有2个excel文件,一个有7.5k条logging,另外有7k条logging。 我们需要通过保持一个固定的特定列与一张纸进行比较,并与另一张纸进行比较。

例如sheet1:

**Emp_ID| Name| Phone| Address** ------------------------------------- 1 | A | 123 | ABC ------------------------------------- 2 | B | 456 | CBD ------------------------------------- 3 | C | 789 | S 

对于示例表2:

 **Emp_ID| Name| Phone| Address** ------------------------------------- 1 | A | 123 | ABC ------------------------------------- 3 | C | 789 | S 

在执行python脚本时,应该以Emp_ID和Emp_ID = 2为基础进行Python比较,并将Emp_ID作为parameter passing。 我正在尝试使用XLRD模块,但它只比较单元格而不是冻结一列,然后将该行与其他Excel文件进​​行比较。

 def compareexcel(oldSheet, newSheet): rowb2 = xlrd.open_workbook(oldSheet) rowb1 = xlrd.open_workbook(newSheet) sheet1 = rowb1.sheet_by_index(0) sheet2 = rowb2.sheet_by_index(0) for rownum in range(max(sheet1.nrows, sheet2.nrows)): if rownum < sheet1.nrows: row_rb1 = sheet1.row_values(rownum) row_rb2 = sheet2.row_values(rownum) for colnum, (c1, c2) in enumerate(izip_longest(row_rb1, row_rb2)): if c1 != c2: print "Row {} Col {} - {} != {}".format(rownum+1, colnum+1, c1, c2) 

我已经写了一个函数来search一个列值到另一个表中,并在比较函数的基础上进行比较

 def search(sheet2 , s): for row in range(sheet2.nrows):`enter code here` if s == sheet2.cell(row,0).value: return (row,0) return (9,9) def compare(oldPerPaxSheet,newPerPaxSheet): rb1 = xlrd.open_workbook(oldPerPaxSheet) rb2 = xlrd.open_workbook(newPerPaxSheet) sheet1 = rb1.sheet_by_index(0) sheet2 = rb2.sheet_by_index(0) for rownum in range(max(self.sheet1.nrows, self.sheet2.nrows)): if rownum < sheet1.nrows: row_rb1 = sheet1.row_values(rownum) print ("row_rb1 : "), row_rb1 search_str = sheet1.cell(rownum,0).value r,c = search(sheet2,search_str) if (c != 9): row_rb2 = sheet2.row_values(r) for colnum, (c1, c2) in enumerate(izip_longest(row_rb1, row_rb2)): if c1 != c2: print "Row {} Col {} - {} != {}".format(rownum+1, colnum+1, c1, c2) else: print ("ROw does not exists in the other sheet") pass else: print ("Row {} missing").format(rownum+1) 

你可以很容易地使用pandas.read_excel

我将使用Emp_ID作为索引创build2个DataFrame

 import pandas as pd sheets = pd.read_excel(excel_filename, sheetname=[old_sheet, new_sheet], index_col=0) sheet1 = sheets[old_sheet] sheet2 = sheets[new_sheet] 

我添加了一些行,有更明确的分歧

工作表Sheet1

  Name Phone Address Emp_ID 1 A 123 ABC 2 B 456 CBD 3 C 789 S 5 A 123 ABC 

Sheet2中

  Name Phone Address Emp_ID 1 A 123 ABC 3 C 789 S 4 D 12 A 5 E 123 ABC 

那么计算缺less的Emp_ID就变得非常简单了

 missing_in_1 = set(sheet2.index) - set(sheet1.index) missing_in_2 = set(sheet1.index) - set(sheet2.index) 

missing_in_1,m​​issing_in_2

 ({4}, {2}) 

所以sheet1没有在sheet2中的Emp_ID4,而sheet2没有按照预期的那样设置2

然后为了寻找差异,我们在两张纸上进行内部连接

 combined = pd.merge(sheet1, sheet2, left_index=True, right_index=True, suffixes=('_1', '_2')) 

结合

  Name_1 Phone_1 Address_1 Name_2 Phone_2 Address_2 Emp_ID 1 A 123 ABC A 123 ABC 3 C 789 SC 789 S 5 A 123 ABC E 123 ABC 

并遍历sheet1的列以查找差异并将其保存在dict

 differences = {} for column in sheet1.columns: diff = combined[column+'_1'] != combined[column+'_2'] if diff.any(): differences[column] = list(combined[diff].index) 

分歧

 {'Name': [5]} 

如果你想要整个差异列表,你可以将最后一行改为differences[column] = combined[diff]

分歧

 {'Name': Name_1 Phone_1 Address_1 Name_2 Phone_2 Address_2 Emp_ID 5 A 123 ABC E 123 ABC}