用pandas阅读Excel XML .xls文件

我知道一些以前提出的问题，但是没有一个解决scheme给出了我在下面提供的可重复的例子。

我正在尝试从http://www.eia.gov/coal/data.cfm#production中读取.xls文件 – 特别是历史详细的煤炭生产数据（1983 – 2013年） coalpublic2012.xls文件，可以通过落下。 pandas看不懂。

相比之下，最近一年的文件，2013年， coalpublic2013.xls文件，工作没有问题：

 import pandas as pd df1 = pd.read_excel("coalpublic2013.xls")

但未来十年的.xls文件（2004-2012）不会加载。我用Excel查看了这些文件，然后打开，并没有损坏。

我从pandas得到的错误是：

 --------------------------------------------------------------------------- XLRDError Traceback (most recent call last) <ipython-input-28-0da33766e9d2> in <module>() ----> 1 df = pd.read_excel("coalpublic2012.xlsx") /Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, engine, **kwds) 161 162 if not isinstance(io, ExcelFile): --> 163 io = ExcelFile(io, engine=engine) 164 165 return io._parse_excel( /Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in __init__(self, io, **kwds) 204 self.book = xlrd.open_workbook(file_contents=data) 205 else: --> 206 self.book = xlrd.open_workbook(io) 207 elif engine == 'xlrd' and isinstance(io, xlrd.Book): 208 self.book = io /Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/__init__.pyc in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows) 433 formatting_info=formatting_info, 434 on_demand=on_demand, --> 435 ragged_rows=ragged_rows, 436 ) 437 return bk /Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows) 89 t1 = time.clock() 90 bk.load_time_stage_1 = t1 - t0 ---> 91 biff_version = bk.getbof(XL_WORKBOOK_GLOBALS) 92 if not biff_version: 93 raise XLRDError("Can't determine file's BIFF version") /Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in getbof(self, rqd_stream) 1228 bof_error('Expected BOF record; met end of file') 1229 if opcode not in bofcodes: -> 1230 bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8]) 1231 length = self.get2bytes() 1232 if length == MY_EOF: /Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in bof_error(msg) 1222 if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile) 1223 def bof_error(msg): -> 1224 raise XLRDError('Unsupported format, or corrupt file: ' + msg) 1225 savpos = self._position 1226 opcode = self.get2bytes() XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<?xml ve'

我也尝试过其他各种各样的东西：

 df = pd.ExcelFile("coalpublic2012.xls", encoding_override='cp1252') import xlrd wb = xlrd.open_workbook("coalpublic2012.xls")

无济于事。我的pandas版本：0.17.0

我也已经提交这个pandasgithub 问题列表的错误。

您可以以编程方式转换此Excel XML文件。要求：只有python和pandas。

 import pandas as pd from xml.sax import ContentHandler, parse # Reference https://goo.gl/KaOBG3 class ExcelHandler(ContentHandler): def __init__(self): self.chars = [ ] self.cells = [ ] self.rows = [ ] self.tables = [ ] def characters(self, content): self.chars.append(content) def startElement(self, name, atts): if name=="Cell": self.chars = [ ] elif name=="Row": self.cells=[ ] elif name=="Table": self.rows = [ ] def endElement(self, name): if name=="Cell": self.cells.append(''.join(self.chars)) elif name=="Row": self.rows.append(self.cells) elif name=="Table": self.tables.append(self.rows) excelHandler = ExcelHandler() parse('coalpublic2012.xls', excelHandler) df1 = pd.DataFrame(excelHandler.tables[0][4:], columns=excelHandler.tables[0][3])

问题是，虽然2013年的数据是一个实际的Excel文件，但2012年的数据是一个XML文档，似乎在Python中不被支持。我会说你最好的select是在Excel中打开它，并保存一个副本作为一个适当的Excel文件，或作为一个CSV。

您可以以编程方式转换此Excel XML文件。要求：Windows，Office安装。

1.在记事本中创buildExcelToCsv.vbs脚本：

 if WScript.Arguments.Count < 3 Then WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>" Wscript.Quit End If csv_format = 6 Set objFSO = CreateObject("Scripting.FileSystemObject") src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0)) dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1)) worksheet_number = CInt(WScript.Arguments.Item(2)) Dim oExcel Set oExcel = CreateObject("Excel.Application") Dim oBook Set oBook = oExcel.Workbooks.Open(src_file) oBook.Worksheets(worksheet_number).Activate oBook.SaveAs dest_file, csv_format oBook.Close False oExcel.Quit

以CSV格式转换Excel XML文件：

$ cscript ExcelToCsv.vbs coalpublic2012.xls coalpublic2012.csv 1

用pandas打开CSV文件

>>> df1 = pd.read_csv('coalpublic2012.csv', skiprows=3)

参考：更快地读取Excel文件到pandas数据框

用pandas阅读Excel XML .xls文件

转换度分到十进制度

如何使用VBA从Excel用户窗体读取checkbox的值

Activechart.name每次都会抛出“内存不足”的错误

从Excel中删除function区

如何在打开Excel工作簿时运行macros，然后在完成运行后重新保护？

在VBA中使用自定义数据types

如何在单元格中的文本之间添加垂直线，vba excel

在引用的单元格更新后，Wijmo Flexsheet公式不会重新评估

如何自动更新Excel工作簿中的多个工作表？

通过ADODB将数据从Oracle加载到Excel中 – 性能问题