使用Pandas在Python中复制Excel的IndexMatch

我有一个Excel电子表格,我经常更新(每天2-3次)。 此更新需要运行索引匹配来从另一个电子表格中的表中提取值,并将其写入第一个列中的列。 值覆盖旧的,而不是创build一个新的列。

我想使用pandas(和xlwings将数据写入电子表格,但我没有与该部分的问题)自动化此过程。 第一步是复制excel的INDEXMATCH()和pandas。 总的来说,该function应该:

  • 采用的参数是要编入索引的列的string标题,要写入的列以及包含用于匹配读写列的值的列

  • 迭代写入列; 在每次迭代中,在读取列中search对应的匹配列值与写入列的匹配列值匹配的值

  • 如果没有匹配值,则将NaN或“#N / A”写入dataframe(重要的是区分0和不匹配)

我期望在pandas中有一个本地的vlookup / indexmatchfunction,但我能find的唯一的东西是关于连接或合并数据框,这不是我想要做的 – 我想覆盖数据框中的各个值,并以任意的索引顺序进行。

我已经设法使用一个非常丑陋的特定于脚本的函数来工作,但是我认为尝试将函数推广到其他用途将是有用的。 经过一些清理和重写,我有以下几点:

##Index Match in Python with pandas #Remember that dataframes start at 0, excel starts at 1 #This only works if both DFs have the same indices (integers, strings, whatever) import numpy as np import pandas as pd #sample dataframes d = {'Match Column' : [0.,1.,2.,3.,4.,7.,'string'], 'Read Column' : ['zero','one','two','three','four','seven','string']} dfRead = pd.DataFrame(d) d2 = {'Match Column' : [0.,1.,2.,3.,4.,5.,6.,7.,'8'], 'Write Column' : [0,0,0,0,0,0,0,0,'0']} dfWrite = pd.DataFrame(d2) #test arguments ReadColumn = 'Read Column' WriteColumn = 'Write Column' ReadMatchColumn = 'Match Column' WriteMatchColumn = 'Match Column' def indexmatch(dfRead, dfWrite, ReadColumn, WriteColumn, ReadMatchColumn, WriteMatchColumn, skiprows=0): #convert the string inputs to a column number for each dataframe RCNum = np.where(dfRead.columns == ReadColumn)[0][0] WCNum = np.where(dfWrite.columns == WriteColumn)[0][0] RMCNum = np.where(dfRead.columns == ReadMatchColumn)[0][0] WMCNum = np.where(dfWrite.columns == WriteMatchColumn)[0][0] for i in range(skiprows,len(dfWrite.index),1): match = dfWrite.loc[dfWrite.index[i]][WMCNum] #the value we're using to match the columns try: matchind = dfRead.index[np.where(dfRead[ReadMatchColumn] == match)[0][0]] value = dfRead.fillna('#N/A').loc[matchind][RCNum] #replaces DF NaN values with excel's #N/A, optional method dfWrite.set_value(dfWrite.index[i],WriteColumn,value) except KeyError: dfWrite.set_value(dfWrite.index[i],WriteColumn,np.nan) #if there is no match, write NaN to the 'cell' except IndexError: dfWrite.set_value(dfWrite.index[i],WriteColumn,np.nan) 

这是有效的,但并不美观,当你想要将一个列与另一个数据框的索引进行匹配时(例如,将数据框与数据透视表数据框相匹配),这不起作用。

有没有一个更强大和简洁的方法来做到这一点?

按要求,预期投入和产出:

 In [2]: dfRead Out[2]: Match Column Read Column 0 0 zero 1 1 one 2 2 two 3 3 three 4 4 four 5 7 seven 6 string string In [3]: dfWrite Out[3]: Match Column Write Column 0 0 0 1 1 0 2 2 0 3 3 0 4 4 0 5 5 0 6 6 0 7 7 0 8 8 0 In [4]: indexmatch(dfRead, dfWrite, 'Read Column', 'Write Column', 'Match Column', 'Match Column') In [5]: dfWrite Out[7]: Match Column Write Column 0 0 zero 1 1 one 2 2 two 3 3 three 4 4 four 5 5 NaN 6 6 NaN 7 7 seven 8 8 NaN 

pd.Series.map会把一个Series作为参数,如果用一个索引作为关键字来input一个字典,就会这样处理它。

在这里应用,看起来像

 dfWrite['Write Column'] = dfWrite['Match Column'].map(dfRead.set_index('Match Column')['Read Column']) dfWrite Out[409]: Match Column Write Column 0 0 zero 1 1 one 2 2 two 3 3 three 4 4 four 5 5 NaN 6 6 NaN 7 7 seven 8 8 NaN 

给相同的输出

 indexmatch(dfRead, dfWrite, 'Read Column', 'Write Column', 'Match Column', 'Match Column') dfWrite Out[413]: Match Column Write Column 0 0 zero 1 1 one 2 2 two 3 3 three 4 4 four 5 5 NaN 6 6 NaN 7 7 seven 8 8 NaN 

要匹配dfRead的索引,请跳过.set_index(...)步骤。 要匹配dfWrite的索引,请将dfWrite['Match Column'].map dfWrite.index.to_series().map dfWrite['Match Column'].mapdfWrite.index.to_series().map

您也可以使用mergefunction:

 dfWrite = pd.merge(left=dfWrite.ix[:,['Match Column']], right=dfRead, on='Match Column', how='left') dfWrite.rename(columns={'Read Column':'Write Column'}, inplace=True)