pandas:根据公共列名称将多个数据框中的列提取到新的数据框中

我从Excel导入4个数据集,包含2013学年,2014,2015和2016学年的total_budget。所有数据集都有一个共同的列,每个学校的ID代码(列LAESTAB)。

我想要一个新的数据集,左边是共同列LAESTAB(4个数据集中的值相同),然后是总共2013,总计2014,总计2015和总数2016(来自不同数据集)。

我也想摆脱其余的数据,包括所有数据集中不存在的学校ID。

我将尝试在一个例子中进一步阐述它:

下面是一个Excel数据集的例子:

>>> print cuts2016.head() LA_codelocal_authority_name UPIN URN LAESTAB \ 0 201 City of London 500000 0.0 2013614 1 202 Camden 500005 0.0 2022095 2 202 Camden 500007 0.0 2022219 3 202 Camden 500012 0.0 2022502 4 202 Camden 500014 0.0 2022603 School Name Academy? Phase Provider Type \ 0 Sir John Cass's Foundation Primary School No Primary School 1 Carlton Primary School No Primary School 2 Fleet Primary School No Primary School 3 Rhyl Primary School No Primary School 4 Torriano Primary School No Primary School MFG protection (+ve) or capping/scaling (-ve) total2016 \ 0 35000 1659000 1 68000 1956000 2 -10000 1059000 3 97000 2234000 4 0 2284000 

2005年的另一个Excel数据集:

 print cuts2015.head() LA_code local_authority_name UPIN URN LAESTAB \ 0 201 City of London NaN 100000 2013614 1 202 Camden NaN 100008 2022019 2 202 Camden NaN 100009 2022036 3 202 Camden NaN 100010 2022065 4 202 Camden NaN 100011 2022078 school_name Phase Provider Type \ 0 Sir John Cass's Foundation Primary School Primary School 1 Argyle Primary School Primary School 2 Beckford Primary School Primary School 3 Brecknock Primary School Primary School 4 Brookfield Primary School Primary School Basic Entitlement Total Funding Deprivation Total Funding total_pre_MFG \ 0 1,206,000 215,000 1,644,000 1 1,333,000 367,000 2,068,000 2 1,482,000 359,000 2,221,000 3 1,234,000 348,000 1,974,000 4 1,436,000 256,000 2,028,000 MFG protection (+ve) or capping/scaling (-ve) total2015 \ 0 0 1644000 1 25,000 2093000 2 0 2221000 3 72,000 2046000 4 -58,000 1970000 

我需要的最终结果如下(应该显示总数2014和总数2013):

 LAESTAB total2016 total2015 etc...\ 2013614 1956000 1644000 2022019 1059000 2093000 2022036 2234000 2221000 2022065 2284000 1970000 ... 

我已经尝试过如下所述的“减less”,但它返回0行×66列。

 dataframe_list = [cuts2013, cuts2014, cuts2015, cuts2016] df_final = reduce(lambda left,right: pd.merge(left,right,on='LAESTAB'), dataframe_list) 

其中一种方法就是使用合并,如Mainul Islam指出的那样。 在这里你必须做3个合并操作来合并4个dataframe。 否则,您可以连接所有4个dataframe并执行groupby操作。

 dataframe_list = [cuts2013, cuts2014, cuts2015, cuts2016] total = pd.concat(dataframe_list) total = total.groupby('LAESTAB')['total2013', 'total2014', 'total2015','total2016'].sum().reset_index() 

使用LAESTAB列合并数据框SQL样式,然后根据需要删除data_merged中的列。

 import pandas as pd data_merged = pd.merge(cuts2016,cuts2015,on = "LAESTAB") 

有关合并的更多信息,请查看以下链接:

http://chrisalbon.com/python/pandas_join_merge_dataframe.html

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html