用pandas阅读Excel XML .xls文件

我知道一些以前提出的问题,但是没有一个解决scheme给出了我在下面提供的可重复的例子。

我正在尝试从http://www.eia.gov/coal/data.cfm#production中读取.xls文件 – 特别是历史详细的煤炭生产数据(1983 – 2013年) coalpublic2012.xls文件,可以通过落下。 pandas看不懂。

相比之下,最近一年的文件,2013年, coalpublic2013.xls文件,工作没有问题:

 import pandas as pd df1 = pd.read_excel("coalpublic2013.xls") 

但未来十年的.xls文件(2004-2012)不会加载。 我用Excel查看了这些文件,然后打开,并没有损坏。

我从pandas得到的错误是:

 --------------------------------------------------------------------------- XLRDError Traceback (most recent call last) <ipython-input-28-0da33766e9d2> in <module>() ----> 1 df = pd.read_excel("coalpublic2012.xlsx") /Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, engine, **kwds) 161 162 if not isinstance(io, ExcelFile): --> 163 io = ExcelFile(io, engine=engine) 164 165 return io._parse_excel( /Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in __init__(self, io, **kwds) 204 self.book = xlrd.open_workbook(file_contents=data) 205 else: --> 206 self.book = xlrd.open_workbook(io) 207 elif engine == 'xlrd' and isinstance(io, xlrd.Book): 208 self.book = io /Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/__init__.pyc in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows) 433 formatting_info=formatting_info, 434 on_demand=on_demand, --> 435 ragged_rows=ragged_rows, 436 ) 437 return bk /Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows) 89 t1 = time.clock() 90 bk.load_time_stage_1 = t1 - t0 ---> 91 biff_version = bk.getbof(XL_WORKBOOK_GLOBALS) 92 if not biff_version: 93 raise XLRDError("Can't determine file's BIFF version") /Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in getbof(self, rqd_stream) 1228 bof_error('Expected BOF record; met end of file') 1229 if opcode not in bofcodes: -> 1230 bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8]) 1231 length = self.get2bytes() 1232 if length == MY_EOF: /Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in bof_error(msg) 1222 if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile) 1223 def bof_error(msg): -> 1224 raise XLRDError('Unsupported format, or corrupt file: ' + msg) 1225 savpos = self._position 1226 opcode = self.get2bytes() XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<?xml ve' 

我也尝试过其他各种各样的东西:

 df = pd.ExcelFile("coalpublic2012.xls", encoding_override='cp1252') import xlrd wb = xlrd.open_workbook("coalpublic2012.xls") 

无济于事。 我的pandas版本:0.17.0

我也已经提交这个pandasgithub 问题列表的错误。

您可以以编程方式转换此Excel XML文件。 要求:只有python和pandas。

 import pandas as pd from xml.sax import ContentHandler, parse # Reference https://goo.gl/KaOBG3 class ExcelHandler(ContentHandler): def __init__(self): self.chars = [ ] self.cells = [ ] self.rows = [ ] self.tables = [ ] def characters(self, content): self.chars.append(content) def startElement(self, name, atts): if name=="Cell": self.chars = [ ] elif name=="Row": self.cells=[ ] elif name=="Table": self.rows = [ ] def endElement(self, name): if name=="Cell": self.cells.append(''.join(self.chars)) elif name=="Row": self.rows.append(self.cells) elif name=="Table": self.tables.append(self.rows) excelHandler = ExcelHandler() parse('coalpublic2012.xls', excelHandler) df1 = pd.DataFrame(excelHandler.tables[0][4:], columns=excelHandler.tables[0][3]) 

问题是,虽然2013年的数据是一个实际的Excel文件,但2012年的数据是一个XML文档,似乎在Python中不被支持。 我会说你最好的select是在Excel中打开它,并保存一个副本作为一个适当的Excel文件,或作为一个CSV。

您可以以编程方式转换此Excel XML文件。 要求:Windows,Office安装。

1.在记事本中创buildExcelToCsv.vbs脚本:

 if WScript.Arguments.Count < 3 Then WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>" Wscript.Quit End If csv_format = 6 Set objFSO = CreateObject("Scripting.FileSystemObject") src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0)) dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1)) worksheet_number = CInt(WScript.Arguments.Item(2)) Dim oExcel Set oExcel = CreateObject("Excel.Application") Dim oBook Set oBook = oExcel.Workbooks.Open(src_file) oBook.Worksheets(worksheet_number).Activate oBook.SaveAs dest_file, csv_format oBook.Close False oExcel.Quit 
  1. 以CSV格式转换Excel XML文件:

$ cscript ExcelToCsv.vbs coalpublic2012.xls coalpublic2012.csv 1

  1. 用pandas打开CSV文件

>>> df1 = pd.read_csv('coalpublic2012.csv', skiprows=3)

参考: 更快地读取Excel文件到pandas数据框