Pandas.read_excel读取一组xlsx文件的KeyError

我正在使用数据analitycs Anaconda壳Uplodingpandas一堆的Excel文件(25个文件)在这个文件https://www.dropbox.com/s/16ea1cw6k63i16p/Newdata.zip?dl=0我得到错误。 不能find为什么以及如何解决它的原因。

import pandas as pd import numpy as np import os os.chdir(r"C:\Users\Twentyouts\Desktop\Newdata" ) path = os.getcwd() files = os.listdir(path) files_xlsx = [f for f in files if f[-4:] == 'xlsx'] for f in files_xlsx: print(f) loading = pd.read_excel(f, heading = 0) df = df.append(loading) 2016-06-20–2016-06-26.xlsx 2016-06-27–2016-07-03.xlsx 2016-07-04–2016-07-10.xlsx 2016-07-11–2016-07-17.xlsx 2016-08-01–2016-08-07.xlsx 2016-08-15–2016-08-21.xlsx 

 KeyError Traceback (most recent call last) <ipython-input-23-5737d4d13b9f> in <module>() 1 df = pd.DataFrame() ----> 2 pd.read_excel('2016-08-15–2016-08-21.xlsx') C:\Users\Twentyouts\Anaconda3\lib\site-packages\pandas\io\excel.py in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, true_values, false_values, engine, squeeze, **kwds) 189 190 if not isinstance(io, ExcelFile): --> 191 io = ExcelFile(io, engine=engine) 192 193 return io._parse_excel( C:\Users\Twentyouts\Anaconda3\lib\site-packages\pandas\io\excel.py in __init__(self, io, **kwds) 247 self.book = xlrd.open_workbook(file_contents=data) 248 elif isinstance(io, compat.string_types): --> 249 self.book = xlrd.open_workbook(io) 250 else: 251 raise ValueError('Must explicitly set engine if not passing in' C:\Users\Twentyouts\Anaconda3\lib\site-packages\xlrd\__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows) 420 formatting_info=formatting_info, 421 on_demand=on_demand, --> 422 ragged_rows=ragged_rows, 423 ) 424 return bk C:\Users\Twentyouts\Anaconda3\lib\site-packages\xlrd\xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows) 831 x12sheet = X12Sheet(sheet, logfile, verbosity) 832 heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname) --> 833 x12sheet.process_stream(zflo, heading) 834 del zflo 835 C:\Users\Twentyouts\Anaconda3\lib\site-packages\xlrd\xlsx.py in own_process_stream(self, stream, heading) 546 for event, elem in ET.iterparse(stream): 547 if elem.tag == row_tag: --> 548 self_do_row(elem) 549 elem.clear() # destroy all child elements (cells) 550 elif elem.tag == U_SSML12 + "dimension": C:\Users\Twentyouts\Anaconda3\lib\site-packages\xlrd\xlsx.py in do_row(self, row_elem) 743 else: 744 bad_child_tag(child_tag) --> 745 value = error_code_from_text[tvalue] 746 self.sheet.put_cell(rowx, colx, XL_CELL_ERROR, value, xf_index) 747 elif cell_type == "inlineStr": KeyError: None 

的确如@MaxU指出的那样,Excel文件格式不正确,但有趣的是,如果正确保存为.xlsx文件,则会解决这个问题。 可能的是,通过简单地将扩展名更改为.xlsx,试图从先前的.xls版本升级无效文件。 这两种文件格式不是简单的文本文件,可以改变扩展名,而不会造成危害,但是二进制格式非常不同

考虑使用wn32com模块运行COM接口,以使用Excel的Workbook.SaveAs方法将格式错误的文件正确保存到实际的OpenXML工作簿中 。 注意:此解决scheme仅适用于安装了MS Excel的Windows for Python。

 import pandas as pd import glob import win32com.client as win32 xlsxfiles = glob.glob("C:\\Path\\To\\Workbooks\\*.xlsx") def save_xlsx(srcfile): try: newfile = srcfile.replace('.xlsx', '_new.xlsx') print('Malformed file saved as {}'.format(newfile)) xlApp = win32.gencache.EnsureDispatch('Excel.Application') wb = xlApp.Workbooks.Open(srcfile) wb.SaveAs(newfile, 51) except Exception as e: print(e) finally: wb.Close(True); wb = None xlApp.Quit; xlApp = None return newfile def xl_read(): dfs = [] for f in xlsxfiles: try: df = pd.read_excel(f) except Exception as e: df = pd.read_excel(save_xlsx(f)) print('File: {}, Shape: {}'.format(f, df.shape)) dfs.append(df) return pd.concat(dfs) print('Final dataframe shape: {}'.format(xl_read().shape)) 

输出 (330257行和30列的最终dataframe)

 File: C:\Path\To\Workbooks\2016-06-20–2016-06-26.xlsx, Shape: (5912, 27) File: C:\Path\To\Workbooks\2016-06-27–2016-07-03.xlsx, Shape: (5362, 27) File: C:\Path\To\Workbooks\2016-07-04–2016-07-10.xlsx, Shape: (5387, 27) File: C:\Path\To\Workbooks\2016-07-11–2016-07-17.xlsx, Shape: (5331, 28) File: C:\Path\To\Workbooks\2016-08-01–2016-08-07.xlsx, Shape: (4965, 28) Malformed file saved as C:\Path\To\Workbooks\2016-08-15–2016-08-21_new.xlsx File: C:\Path\To\Workbooks\2016-08-15–2016-08-21.xlsx, Shape: (5315, 27) File: C:\Path\To\Workbooks\2016-08-22–2016-08-28.xlsx, Shape: (5179, 27) File: C:\Path\To\Workbooks\2016-08-29–2016-09-04.xlsx, Shape: (5855, 27) Malformed file saved as C:\Path\To\Workbooks\2016-09-05–2016-09-11_new.xlsx File: C:\Path\To\Workbooks\2016-09-05–2016-09-11.xlsx, Shape: (5838, 27) Malformed file saved as C:\Path\To\Workbooks\2016-09-12–2016-09-18_new.xlsx File: C:\Path\To\Workbooks\2016-09-12–2016-09-18.xlsx, Shape: (5729, 27) Malformed file saved as C:\Path\To\Workbooks\2016-09-19–2016-09-25_new.xlsx File: C:\Path\To\Workbooks\2016-09-19–2016-09-25.xlsx, Shape: (6401, 27) File: C:\Path\To\Workbooks\2016-09-26–2016-10-02.xlsx, Shape: (7018, 27) File: C:\Path\To\Workbooks\2016-09.xlsx, Shape: (23874, 27) File: C:\Path\To\Workbooks\2016-10-03–2016-10-09.xlsx, Shape: (6587, 27) Malformed file saved as C:\Path\To\Workbooks\2016-10-10–2016-10-12_new.xlsx File: C:\Path\To\Workbooks\2016-10-10–2016-10-12.xlsx, Shape: (2883, 27) Malformed file saved as C:\Path\To\Workbooks\2016-10-10–2016-10-13_new.xlsx File: C:\Path\To\Workbooks\2016-10-10–2016-10-13.xlsx, Shape: (4174, 27) Malformed file saved as C:\Path\To\Workbooks\2016-10-17–2016-10-20_new.xlsx File: C:\Path\To\Workbooks\2016-10-17–2016-10-20.xlsx, Shape: (4560, 27) Malformed file saved as C:\Path\To\Workbooks\2016-10-17–2016-10-23_new.xlsx File: C:\Path\To\Workbooks\2016-10-17–2016-10-23.xlsx, Shape: (7111, 27) Malformed file saved as C:\Path\To\Workbooks\2016-10-24–2016-10-27_new.xlsx File: C:\Path\To\Workbooks\2016-10-24–2016-10-27.xlsx, Shape: (4921, 27) Malformed file saved as C:\Path\To\Workbooks\2016-10-24–2016-10-30_new.xlsx File: C:\Path\To\Workbooks\2016-10-24–2016-10-30.xlsx, Shape: (8005, 27) Malformed file saved as C:\Path\To\Workbooks\2016-10-31–2016-11-06_new.xlsx File: C:\Path\To\Workbooks\2016-10-31–2016-11-06.xlsx, Shape: (7029, 27) Malformed file saved as C:\Path\To\Workbooks\2016-10_new.xlsx File: C:\Path\To\Workbooks\2016-10.xlsx, Shape: (28098, 27) Malformed file saved as C:\Path\To\Workbooks\2016-11-07–2016-11-13_new.xlsx File: C:\Path\To\Workbooks\2016-11-07–2016-11-13.xlsx, Shape: (7076, 27) Malformed file saved as C:\Path\To\Workbooks\2016-11-14–2016-11-20_new.xlsx File: C:\Path\To\Workbooks\2016-11-14–2016-11-20.xlsx, Shape: (7758, 27) Malformed file saved as C:\Path\To\Workbooks\2016-11-21_new.xlsx File: C:\Path\To\Workbooks\2016-11-21.xlsx, Shape: (1689, 27) Malformed file saved as C:\Path\To\Workbooks\2016-11-21–2016-11-23_new.xlsx File: C:\Path\To\Workbooks\2016-11-21–2016-11-23.xlsx, Shape: (4711, 27) Malformed file saved as C:\Path\To\Workbooks\2016-11-28–2016-12-04_new.xlsx File: C:\Path\To\Workbooks\2016-11-28–2016-12-04.xlsx, Shape: (9286, 27) Malformed file saved as C:\Path\To\Workbooks\2016-11_new.xlsx File: C:\Path\To\Workbooks\2016-11.xlsx, Shape: (30505, 27) Malformed file saved as C:\Path\To\Workbooks\2016-12-05–2016-12-11_new.xlsx File: C:\Path\To\Workbooks\2016-12-05–2016-12-11.xlsx, Shape: (8802, 27) Malformed file saved as C:\Path\To\Workbooks\2016-12-12–2016-12-18_new.xlsx File: C:\Path\To\Workbooks\2016-12-12–2016-12-18.xlsx, Shape: (8333, 27) Malformed file saved as C:\Path\To\Workbooks\2016-12-16–2016-12-22_new.xlsx File: C:\Path\To\Workbooks\2016-12-16–2016-12-22.xlsx, Shape: (8592, 27) Malformed file saved as C:\Path\To\Workbooks\2016-12-26–2016-12-31_new.xlsx File: C:\Path\To\Workbooks\2016-12-26–2016-12-31.xlsx, Shape: (5362, 27) Malformed file saved as C:\Path\To\Workbooks\2017-01-01–2017-01-08_new.xlsx File: C:\Path\To\Workbooks\2017-01-01–2017-01-08.xlsx, Shape: (4322, 27) Malformed file saved as C:\Path\To\Workbooks\2017-01-09–2017-01-15_new.xlsx File: C:\Path\To\Workbooks\2017-01-09–2017-01-15.xlsx, Shape: (7608, 27) Malformed file saved as C:\Path\To\Workbooks\2017-01-23–2017-01-29_new.xlsx File: C:\Path\To\Workbooks\2017-01-23–2017-01-29.xlsx, Shape: (8903, 27) Malformed file saved as C:\Path\To\Workbooks\2017-01-30–2017-02-05_new.xlsx File: C:\Path\To\Workbooks\2017-01-30–2017-02-05.xlsx, Shape: (9173, 27) Malformed file saved as C:\Path\To\Workbooks\2017-02-13–2017-02-12_new.xlsx File: C:\Path\To\Workbooks\2017-02-13–2017-02-12.xlsx, Shape: (9144, 27) Malformed file saved as C:\Path\To\Workbooks\2017-02-13–2017-02-19_new.xlsx File: C:\Path\To\Workbooks\2017-02-13–2017-02-19.xlsx, Shape: (9911, 27) File: C:\Path\To\Workbooks\test.xlsx, Shape: (5315, 27) Malformed file saved as C:\Path\To\Workbooks\Выгрузка 12-15.12_new.xlsx File: C:\Path\To\Workbooks\Выгрузка 12-15.12.xlsx, Shape: (4818, 27) Malformed file saved as C:\Path\To\Workbooks\Выгрузка 21-27_new.xlsx File: C:\Path\To\Workbooks\Выгрузка 21-27.xlsx, Shape: (8876, 27) File: C:\Path\To\Workbooks\Выгрузка 26-29.12.xlsx, Shape: (4539, 27) Final dataframe shape: (330257, 30) 

甚至考虑使用Windows的ACE引擎通过pyodbc查询相应的工作簿与pandas read_sql的数据库引擎方法,因为每个共享相同的工作表名称, TDSheet

 #...same as above import pyodbc def sql_read(): dfs = [] for f in xlsxfiles: try: conn = pyodbc.connect('Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};'+\ 'DBQ=C:\\Path\\To\\Workbooks\\{};'.format(f), autocommit=True) df = pd.read_sql('SELECT * FROM [TDSheet$];', conn) except Exception as e: conn.close() conn = pyodbc.connect('Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};'+\ 'DBQ=C:\\Path\\To\\Workbooks\\{};'.format(save_xlsx(f)), autocommit=True) df = pd.read_excel('SELECT * FROM [TDSheet$];', conn) conn.close() print('File: {}, Shape: {}'.format(f, df.shape)) dfs.append(df) 

它看起来像你的一些Excel文件格式不正确:

 import os import glob import pandas as pd excel_files_mask = r'D:\temp\.data\42468475\*.xlsx' files = glob.glob(excel_files_mask) def merge_excel_files(files, **kwargs): #return pd.concat([pd.read_excel(f, **kwargs) for f in files], # ignore_index=True) dfs = [] for f in files: #print('processing: [{}]'.format(f)) try: df = pd.read_excel(f, **kwargs) dfs.append(df) print('parsed: [{}], shape: {}'.format(f, df.shape)) except KeyError: print("ERROR: file [{}] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...".format(f)) return pd.concat(dfs, ignore_index=True) df = merge_excel_files(files, header=None, skiprows=1) print(df.shape) 

产量:

 parsed: [D:\temp\.data\42468475\2016-06-20–2016-06-26.xlsx], shape: (5912, 27) parsed: [D:\temp\.data\42468475\2016-06-27–2016-07-03.xlsx], shape: (5362, 27) parsed: [D:\temp\.data\42468475\2016-07-04–2016-07-10.xlsx], shape: (5387, 27) parsed: [D:\temp\.data\42468475\2016-07-11–2016-07-17.xlsx], shape: (5331, 28) parsed: [D:\temp\.data\42468475\2016-08-01–2016-08-07.xlsx], shape: (4965, 28) ERROR: file [D:\temp\.data\42468475\2016-08-15–2016-08-21.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... parsed: [D:\temp\.data\42468475\2016-08-22–2016-08-28.xlsx], shape: (5179, 27) parsed: [D:\temp\.data\42468475\2016-08-29–2016-09-04.xlsx], shape: (5855, 27) ERROR: file [D:\temp\.data\42468475\2016-09-05–2016-09-11.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2016-09-12–2016-09-18.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2016-09-19–2016-09-25.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... parsed: [D:\temp\.data\42468475\2016-09-26–2016-10-02.xlsx], shape: (7018, 27) parsed: [D:\temp\.data\42468475\2016-09.xlsx], shape: (23874, 27) parsed: [D:\temp\.data\42468475\2016-10-03–2016-10-09.xlsx], shape: (6587, 27) ERROR: file [D:\temp\.data\42468475\2016-10-10–2016-10-12.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2016-10-10–2016-10-13.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2016-10-17–2016-10-20.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2016-10-17–2016-10-23.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2016-10-24–2016-10-27.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2016-10-24–2016-10-30.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2016-10-31–2016-11-06.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2016-10.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2016-11-07–2016-11-13.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2016-11-14–2016-11-20.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2016-11-21.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2016-11-21–2016-11-23.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2016-11-28–2016-12-04.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2016-11.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2016-12-05–2016-12-11.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2016-12-12–2016-12-18.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2016-12-16–2016-12-22.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2016-12-26–2016-12-31.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2017-01-01–2017-01-08.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2017-01-09–2017-01-15.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2017-01-23–2017-01-29.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2017-01-30–2017-02-05.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2017-02-13–2017-02-12.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\2017-02-13–2017-02-19.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... parsed: [D:\temp\.data\42468475\test.xlsx], shape: (5315, 27) ERROR: file [D:\temp\.data\42468475\Выгрузка 12-15.12.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... ERROR: file [D:\temp\.data\42468475\Выгрузка 21-27.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ... parsed: [D:\temp\.data\42468475\Выгрузка 26-29.12.xlsx], shape: (4539, 27) (85324, 28)